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Abstract 

This paper develops and implements a practical simulation-based method for estimating 
dynamic discrete choice models. The method, which can accommodate lagged dependent 
variables, serially correlated errors, unobserved variables, and many alternatives, builds on 
the ideas of indirect inference. The main difficulty in implementing indirect inference in 
discrete choice models is that the objective surface is a step function, rendering gradient- 
based optimization methods useless. To overcome this obstacle, this paper shows how to 
smooth the objective surface. The key idea is to use a smoothed function of the latent 
utilities as the dependent variable in the auxiliary model. As the smoothing parameter goes 
to zero, this function delivers the discrete choice implied by the latent utilities, thereby 
guaranteeing consistency. We establish conditions on the smoothing such that our estimator 
enjoys the same limiting distribution as the indirect inference estimator, while at the same 
time ensuring that the smoothing facilitates the convergence of gradient-based optimization 
methods. A set of Monte Carlo experiments shows that the method is fast, robust, and 
nearly as efficient as maximum likelihood when the auxiliary model is sufficiently rich. 


Note. An earlier version of this paper was circulated as the unpublished manuscript Keane 
and Smith (2003). That paper proposed the method of generalized indirect inference (GII), 
but did not formally analyze its asymptotic or computational properties. The present work, 
under the same title but with two additional authors (Bruins and Duffy), rigorously establishes 
the asymptotic and computational properties of GII. It is thus intended to subsume the 2003 
manuscript. Notably, the availability of the 2003 manuscript allowed GII to be used in numerous 
applied studies (see Section 3.3), even though the statistical foundations of the method had not 
been firmly established. The present paper provides these foundations and fills this gap in the 
literature. 
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1 Introduction 

Many economic models have the features that (i) given knowledge of the model parameters, it is 
easy to simulate data from the model, but (ii) estimation of the model parameters is extremely 
difficult. Models with discrete outcomes or mixed discrete/continuous outcomes commonly fall 
into this category. A good example is the multinomial probit (MNP), in which an agent chooses 
from among several discrete alternatives the one with the highest utility. Simulation of data 
from the model is trivial: simply draw utilities for each alternative, and assign to each agent the 
alternative that gives them the greatest utility. But estimation of the MNP, via either maximum 
likelihood (ML) or the method of moments (MOM), is quite difficult. 

The source of the difficulty in estimating the MNP, as with many other discrete choice models, 
is that, from the perspective of the econometrician, the probability an agent chooses a particu¬ 
lar alternative is a high-dimensional integral over multiple stochastic terms (unobserved by the 
econometrician) that affect utilities the agent assigns to each alternative. These probability ex¬ 
pressions must be evaluated many times in order to estimate the model by ML or MOM. For 
many years econometricians worked on developing fast simulation methods to evaluate choice 
probabilities in discrete choice models (see Lerman and Manski, 1981). It was only with the de¬ 
velopment of fast and accurate smooth probability simulators that ML or MOM-based estimation 
in these models became practical (see McFadden, 1989, and Keane, 1994). 

A different approach to inference in discrete choice models is the method of “indirect infer¬ 
ence.” This approach (see Smith, 1990, 1993; Gourieroux, Monfort, and Renault, 1993; Gallant 
and Tauchen, 1996), circumvents the need to construct the choice probabilities generated by the 
economic model, because it is not based on forming the likelihood or forming moments based on 
choice frequencies. Rather, the idea of indirect inference (II) is to choose a statistical model that 
provides a rich description of the patterns in the data. This descriptive model is estimated on 
both the actual observed data and on simulated data from the economic model. Letting (3 de¬ 
note the vector of parameters of the structural economic model, the II estimator is that (3 which 
makes the simulated data “look like” the actual data—in the sense (defined formally below) that 
the descriptive statistical model estimated on the simulated data “looks like” that same model 
estimated on the actual data. (The method of moments is thus a special case of II, in which the 
descriptive statistical model corresponds to a vector of moments.) 

Indirect inference holds out the promise that it should be practical to estimate any economic 
model from which it is practical to simulate data, even if construction of the likelihood or 
population moments implied by the model is very difficult or impossible. But this promise 
has not been fully realized because of limitations in the II procedure itself. It is very difficult 
to apply II to models that include discrete (or discrete/continuous) outcomes for the following 
reason: small changes in the structural parameters of such models will, in general, cause the data 
simulated from the model to change discretely. Such a discrete change causes the parameters 
of a descriptive model fit to the simulated data to jump discretely, and these discontinuities are 
inherited by the criterion function minimized by the II estimator. 

Thus, given discrete (or discrete/continuous) outcomes, the II estimator cannot be imple¬ 
mented using gradient-based optimization methods. One instead faces the difficult computa¬ 
tional task of optimizing a d^-dimensional step function using much slower derivative-free meth- 
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ods. This is very time-consuming and puts severe constraints on the size of the structural models 
that can be feasibly estimated. Furthermore, even if estimates can be obtained, one does not 
have derivatives available for calculating standard errors. 

In this paper we propose a “generalized indirect inference” (GII) procedure to address this 
important problem (Sections 3 and 4). The key idea is to generalize the original II method by 
applying two different descriptive statistical models to the simulated and actual data. As long 
as the two descriptive models share the same vector of pseudo-true parameter values (at least 
asymptotically), the GII estimator based on minimizing the distance between the two models is 
consistent, and will enjoy the same asymptotic distribution as the II estimator. 

While the GII idea has wider applicability, here we focus on how it can be used to resolve the 
problem of non-smooth objective functions of II estimators in the case of discrete choice models. 
Specifically, the model we apply to the simulated data does not fit the discrete outcomes in 
that data. Rather, it fits a “smoothed” version of the simulated data, in which discrete choice 
indicators are replaced by smooth functions of the underlying continuous latent variables that 
determine the model’s discrete outcomes. In contrast, the model we apply to the actual data is 
fit to observed discrete choices (obviously, the underlying latent variables that generate actual 
agents’ observed choices are not seen by the econometrician). 

As the latent variables that enter the descriptive model applied to the simulated data are 
smooth functions of the model parameters, the non-smooth objective function problem is obvi¬ 
ously resolved. However, it remains to show that the GII estimator based on minimizing the 
distance between these two models is consistent and asymptotically normal. We show that, under 
certain conditions on the parameter regulating the smoothing, the GII estimator has the same 
limiting distribution as the II estimator, permitting inferences to be drawn in the usual manner 
(Section 5). 

Our theoretical analysis goes well beyond merely deriving the limiting distribution of the 
minimizer of the GII criterion function. Rather, in keeping with computational motivation 
of this paper, we show that the proposed smoothing facilitates the convergence of derivative- 
based optimizers, in the sense that the smoothing leads to a sample optimization problem that 
is no more difficult than the corresponding population problem, where the latter involves the 
minimization of a necessarily smooth criterion (Section 5). We also provide a detailed analysis of 
the convergence properties of selected line-search and trust-region methods. Our results on the 
convergence of these derivative-based optimizers seem to be new to the literature. (While our 
work here is in some respects related to the theory of A:-step estimators, we depart significantly 
from that literature, for example by dropping the usual requirement that the optimizations 
commence from the starting values provided by some consistent initial estimator.) 

Finally, we provide Monte Garlo evidence indicating that the GII procedure performs well on 
a set of example models (Section 6). We look at some cases where simulated maximum likelihood 
(SML) is also feasible, and show that efficiency losses relative to SML are small. We also show 
how judicious choice of the descriptive (or auxiliary) model is very important for the efficiency 
of the estimator. This is true not only here, but for II more generally. 

Proofs of the theoretical results stated in the paper are given in Appendices B-E. An index 
of key notation appears in Appendix F. All limits are taken as n —)■ oo. 
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2 The model 

We first describe a class of discrete choice models that we shall use as test cases for the estimation 
method that we develop in this paper. As will become clear, however, the ideas underlying the 
method could be applied to almost any conceivable model of discrete choice, including models 
with mixed discrete/continuous outcomes, and even models in which individuals’ choices solve 
forward-looking dynamic programming problems. 

We henceforth focus mainly on panel data models with n individuals, each of whom selects 
a choice from a set of J discrete alternatives in each of T time periods. Let uuj be the (latent) 
utility that individual i attaches to alternative j in period t. Without loss of generality, set the 
utility of alternative J in any period equal to 0. In each period, each individual chooses the 
alternative with the highest utility. Let yuj be equal to 1 if individual i chooses alternative j in 
period t and be equal to 0 otherwise. Dehne uu := {uui,, uuj-i) and yu := {yui, •. •, yit,J-i). 
The econometrician observes the choices {yu} but not the latent utilities {uu}. 

The vector of latent utilities uu is assumed to follow a stochastic process 

uit = t = l,...,T, (2.1) 

where xu is a vector of exogenous variables.^ For each individual i, the vector of disturbances 
eu ■= {eui,..., eit,j-i) follows a Markov process eu = /?), where {'nit}'^=i is a sequence 

of i.i.d. random vectors (of dimension J — 1) having a specihed distribution (which does not 
depend on /3). The functions / and g depend on a set of k structural parameters /3 G B. The 
sequences {git}t=i, i = 1,... ,n, are independent across individuals and independent of xu for 
all i and t. The initial values e^o and yu, t = 0, —1, ... ,1 — I, are hxed exogenously. 

Although the estimation method proposed in this paper can (in principle) be applied to 
any model of this form, we focus on four special cases of the general model. Three of these 
cases (Models 1, 2, and 4 below) can be feasibly estimated using simulated maximum likelihood, 
allowing us to compare its performance with that of the proposed method. 

Model 1. J = 2, T > 1, and uu = bxu + eu, where xu is a scalar, eu = rei^t-i + Vu, Vu ~i.i.d. 
A^[0,1], and Cjo = 0. This is a two-alternative dynamic probit model with serially correlated 
errors; it has two unknown parameters b and r. 

Model 2. J = 2, T > 1, and uu = bixu + b 2 yi,t-i + eu, where xu is a scalar and eu follows 
the same process as in Model 1. The initial value yu) is set equal to 0. This is a two-alternative 
dynamic probit model with serially correlated errors and a lagged dependent variable; it has 
three unknown parameters bi, b 2 , and r. 

Model 3. Identical to Model 2 except that the econometrician does not observe the hrst s < T 
of the individual’s choices. Thus there is an “initial conditions” problem (see Heckman, 1981). 

^The estimation method proposed in this paper can also accommodate models in which the latent utilities in 
any given period depend on lagged values of the latent utilities. 
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Model 4. J = 3, T = 1, and the latent utilities obey: 


Uii = bio + biiXii + bi2Xi2 + rjii 

Ui2 = &20 + + b22Xi'i + Cirjii + C 2 r?j 2 , 

where {'r]ii,r]i 2 ) -/V[0,/ 2 ]- (Since T = 1 in this model, the time subscript has been omit¬ 

ted.) This is a static three-alternative probit model; it has eight unknown parameters {6ifc}|^Q, 
{b2k}l=o, '^2- 

The techniques developed in this paper may also be applied to models with a mixture of 
discrete and continuous outcomes. A leading example is the Heckman selection model: 

Model 5. A selection model with two equations: The first equation determines an individual’s 
wage and the second determines his/her latent utility from working: 

Wi = bio + biixii + cir]ii -F C2r/j2 
Ui = 620 + b2lX2i + b22Wi -h r]i2, 

Here xu and X 2 i are exogenous regressors and {'r]ii,rii 2 ) A^[0,/2]- The unknown parameters 

are {6ifc}fc=0’ {^2fc}fc=o’ ’' 2 - Let yi := I{ui > 0) be an indicator for employment status. 

The econometrician observes the outcome yi but not the latent utility Ui. In addition, the 
econometrician observes a person’s wage Wi if and only if he/she works (i.e. if y^ = 1). 

3 Generalized indirect inference 

We propose to estimate the model in Section 2 via a generalization of indirect inference. First, 
in Section 3.1 we exposit the method of indirect inference as originally formulated. In Section 
3.2 we explain the difficulty of applying the original approach to discrete choice models. Then, 
Section 3.3 presents our generalized indirect inference estimator that resolves this difficulty. 

3.1 Indirect inference 

Indirect inference exploits the ease and speed with which one can typically simulate data from 
even complex structural models. The basic idea is to view both the observed data and the 
simulated data through the “lens” of a descriptive statistical (or auxiliary) model characterized 
by a set of do auxiliary parameters 9. The djs < do structural parameters /3 are then chosen so as 
to make the observed data and the simulated data look similar when viewed through this lens. 

To formalize these ideas, assume the observed choices {yu}, i = 1,..., n, t = 1,..., T, are 
generated by the structural discrete choice model described in (2.1), for a given value /?o of the 
structural parameters. An auxiliary model can be estimated using the observed data to obtain 
parameter estimates 9n- Formally, 6n solves: 

1 

9n ■= avgmaxCn{y,x;6) = argmax - i{yi,Xi;6), (3.1) 

6»e0 0e0 n 
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where Cn{y,x;9) is the average log-likelihood function (or more generally, some statistical crite¬ 
rion function) associated with the auxiliary model, y '■= {yu} is the set of observed choices, and 
X := {xit} is the set of observed exogenous variables. 

Let T]™' := denote a set of simulated draws for the values of the unobservable components 
of the model, with these draws being independent across m G {1, ... ,M}. Then given x and 
a parameter vector /3, the structural model can be used to generate M corresponding sets of 
simulated choices, y™'(/3) := {y™(/3)}- (Note that the same values of x and { 77 ™'} are used for all 
/3.) Estimating the auxiliary model on the mth simulated dataset thus yields 

C(/^) := argmax£n(y™(/3),x;6l). (3.2) 

eee 

Let 6n{(3) '■= ^ Ylm=i denote the average of these estimates. Under appropriate regularity 

conditions, as the observed sample size n grows large (holding M and T hxed), 9n{P) converges 
uniformly in probability to a non-stochastic function 0(/?), which Gourieroux, Monfort, and 
Renault (1993) term the binding function. 

Loosely speaking, indirect inference generates an estimate /3„ of the structural parameters 
by choosing (3 so as to make 9n and 9n{P) as close as possible, with consistency following from 
9n and 9n{Po) both converging to the same pseudo-true value 9q '■= 9{I3q). To implement the 
estimator we require a formal metric of the distance between 9n and 9n{P)- There are three 
approaches to choosing such a metric, analogous to the three classical approaches to hypothesis 
testing: the Wald, likelihood ratio (LR), and Lagrange multiplier (LM) approaches.^ 

The Wald approach to indirect inference chooses /3 to minimize the weighted distance between 
and 0 ^, 

Q^{P) ■■= IK(/3)-0„|lk, 

where ||x||^ := x'^Ax^ and Wn is a sequence of positive-dehnite weight matrices. 

The LR approach forms a metric implicitly by using the average log-likelihood Cn{y,x;9) 
associated with the auxiliary model. In particular, it seeks to minimize 

n 

Qn^iP) ■= -^n{y, x; 9n{P)) = -V ^(^i, Xf, 9n{P)) 

n 

i=\ 

Finally, the LM approach does not work directly with the estimated auxiliary parameters 
9n{P) but instead uses the score vector associated with the auxiliary model.^ Given the estimated 
auxiliary model parameters 9 from the observed data, the score vector is evaluated using each of 
the M simulated data sets. The LM estimator then minimizes a weighted norm of the average 

^This nomenclature is due to Eric Renault. The Wald and LR approaches were first proposed in Smith (1990, 
1993) and later extended by Gourieroux, Monfort, and Renault (1993). The LM approach was first proposed in 
Gallant and Tauchen (1996). 

^When the LM approach is implemented using an auxiliary model that is (nearly) correctly specified in the 
sense that it provides a (nearly) correct statistical description of the observed data. Gallant and Tauchen (1996) 
refer to this approach as efficient method of moments (EMM). 
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score vector across these datasets, 




m=l 


2 


where £„ denotes the gradient of Cn with respect to 0, and is a sequence of positive-definite 
weight matrices. 

All three approaches yield consistent and asymptotically normal estimates of /3o, and are 
first-order asymptotically equivalent in the exactly identified case in which dp = de. In the 
over-identified case, when the weight matrices Wn and Vn are chosen optimally (in the sense 
of minimizing asymptotic variance) both the Wald and LM estimators are more efficient than 
the LR estimator. However, if the auxiliary model is correctly specified, all three estimators are 
asymptotically equivalent not only to each other but also to maximum likelihood (provided that 
M is sufficiently large). 


3.2 Indirect inference for discrete choice models 

Step functions arise naturally when applying indirect inference to discrete choice models because 
any simulated choice y™(/3) is a step function of /3 (holding fixed the set of random draws {r?™} 
used to generate simulated data from the structural model). Consequently, the sample binding 
function 0n{/3) is discontinuous in /3. Obviously, this discontinuity is inherited by the criterion 
functions minimized by the II estimators in Section 3.1. 

Thus, given discrete outcomes, II cannot be implemented using gradient-based optimization 
methods. One must instead rely on derivative-free methods (such as the Nelder-Mead simplex 
method); random search algorithms (such as simulated annealing); or abandon optimization 
altogether, and instead implement a Laplace-type estimator, via Markov Chain Monte Carlo 
(MCMC; see Chernozhukov and Hong, 2003). But convergence of derivative-free methods is often 
very slow; while MCMC, even when it converges, may produce (in finite samples) an estimator 
substantially different from the optimum of the statistical criterion to which it is applied (see 
Kormiltsina and Nekipelov, 2012). Thus, the non-smoothness of the criterion functions that 
define H estimators render them very difficult to use in the case of discrete data. 

Despite the difficulties in applying H to discrete choice models, the appeal of the H approach 
has led some authors to push ahead and apply it nonetheless. Some notable papers that apply 
H by optimizing non-smooth objective functions are Magnac, Robin, and Visser (1995), An and 
Liu (2000), Nagypal (2007), Eisenhauer, Heckman, and Mosso (2015), Li and Zhang (2015) and 
Skira (2015). Our work aims to make it much easier to apply H in these and related contexts. 

3.3 A smoothed estimator (GII) 

Here we propose a generalization of indirect inference that is far more practical in the context 
of discrete outcomes. The fundamental idea is that the estimation procedures applied to the ob¬ 
served and simulated data sets need not be identical, provided that they both provide consistent 
estimates of the same binding function. (Genton and Ronchetti, 2003, use a similar insight to 
develop robust estimation procedures in the context of indirect inference.) We exploit this idea 
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to smooth the function 6n{/3), obviating the need to optimize a step function when using indirect 
inference to estimate a discrete choice model. 

Let u^j{f3) denote the latent utility that individual i attaches to alternative j £ {1,..., J — 1} 
in period t of the mth simulated data set, given structural parameters {3 (recall that the utility 
of the Jth alternative is normalized to 0). Rather than use the simulated choice y^jiP) when 
computing 6n{/3), we propose to replace it by the following smooth function of the latent utilities, 

C,(/3,A) - <j_i (/?)], 

where K : —)• M is a smooth, mean-zero multivariate cdf, and K\{v) '■= K{X~^v). As 

the smoothing parameter A goes to 0, the preceding converges to y^(/3,0) = Defining 

0nil3, A) := KiP, A), where 

C(/5>A) := argmax£„(y™(/3, A),^;^), (3.3) 

eee 

we may regard 6n{/3,X) as a smoothed estimate of 0(/3), for which it is consistent so long as 
A = A„ —7- 0 as n —)• oo. Accordingly, an indirect inference estimator based on 0„(/3,An), which 
we shall henceforth term the generalized indireet inferenee (GII) estimator, ought to be consistent 
for /3o. 

Each of the three approaches to indirect inference can be generalized simply by replacing 
each simulated choice 2 /(^ (/3) with its smoothed counterpart yi^ (/3, A^). For the Wald and LR 
estimators, this entails using the smoothed sample binding function 9n{f3,Xn) in place of the 
unsmoothed estimate 6n{l3)- (See Section 4.2 below for the exact forms of the criterion functions.) 
The remainder of this paper is devoted to studying the properties of the resulting estimators, 
both analytically (Section 5) and through a series of simulation exercises (Section 6). 

The GII approach was first suggested in an unpublished manuscript by Keane and Smith 
(2003), but they did not derive the asymptotic properties of the estimator. Despite this, GII 
has proven to be popular in practice, and has already been applied in a number of papers, such 
as Gan and Gong (2007), Gassidy (2012), Altonji, Smith, and Vidangos (2013), Morten (2013), 
Ypma (2013), Lopez-Mayan (2014) and Lopez Garcia (2015). Given the growing popularity of 
the method, a careful analysis of its asymptotic properties is obviously needed. 

3.4 Related literature 

Our approach to smoothing in a discrete choice model bears a superficial resemblance to that used 
by Horowitz (1992) to develop a smoothed version of Manski’s (1985) maximum score estimator 
for a binary response model. As here, the smooth version of maximum score is constructed by 
replacing discontinuous indicators with smooth cdfs in the sample criterion function. 

However, there is a fundamental difference in the statistical properties of the minimiza¬ 
tion problems solved by Manski’s estimator, and the (unsmoothed) indirect inference estimator. 
Specifically, n“^/^-consistent estimators are available for the unsmoothed problem considered in 
this paper (see Theorem 5.1 below, or Pakes and Pollard, 1989); whereas, in the case of Manski’s 
(1985) maximum score estimator, only n“^/^-consistency is obtained without smoothing (see Kim 
and Pollard, 1990), and smoothing yields an estimator with an improved rate of convergence. 
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A potentially more relevant analogue for the present paper is smoothed quantile regression. 
This originates with Horowitz’s (1998) work on the smoothed least absolute deviation estimator, 
extended to more general quantile regression and quantile-IV models by Whang (2006), Otsu 
(2008) and Kaplan and Sun (2012). The latter papers do not smooth the criterion function, 
but rather the estimating equations (approximate first-order conditions) that equivalently define 
the estimator. These first-order conditions involve indicator-type discontinuities like those in our 
problem, smoothed in the same way. Insofar as the problem of solving the estimating equations is 
analogous to the minimum-distance problem solved by the II estimator, the effects of smoothing 
are similar: in each case smoothing (if done appropriately) affects neither the rate of convergence 
nor the limiting distribution of the estimator, relative to its unsmoothed counterpart. 

The motivation for smoothing in the quantile regression case involves the potential for higher- 
order asymptotic improvements.^ In contrast, in the present setting, which involves structural 
models of possibly great complexity, the potential for higher-order improvements is limited.® The 
key motivation for smoothing in our case is computational. 

Accordingly, much of this paper is devoted to a formal analysis of the potential computational 
gains from smoothing. In particular. Sections 5.4-5.6 are devoted to providing a theoretical 
foundation for our claim that smoothing facilitates the convergence of standard derivative-based 
optimization that are widely used to solve (smooth) optimization problems in practice. 

For the class of models considered in this paper, two leading alternative estimation methods 
that might be conisidered are simulated maximum likelihood (SML) in conjunction with the 
Geweke, Hajivassiliou and Keane (GHK) smooth probability simulator (see Section 4 in Geweke 
and Keane, 2001), and the nonparametric simulated maximum likelihood (NPSML) estimator 
(Higgle and Gratton, 1984; Fermanian and Salanie, 2004; Kristensen and Shin, 2012). However, 
the GHK simulator can only be computed in models possessing a special structure - which is 
true for Models 1, 2 and 4 above, but not for Model 3 - while in models that involve a mixture of 
discrete and continuous outcomes, NPSML may require the calculation of rather high-dimensional 
kernel density estimates in order to construct the likelihood, the accuracy of which may require 
simulating the model a prohibitively large number of times. 

Finally, an alternative approach to smoothing the H estimator is importance sampling, as in 
Keane and Sauer (2010) and Sauer and Taber (2013). The basic idea is to simulate data from 
the structural model only once (at the initial estimate of /3). One holds these simulated data 
fixed as one iterates. Given an updated estimate of (3, one re-weights the original simulated data 
points, so those initial simulations that are more (less) likely under the new f3 (than under the 
initial /3) get more (less) weight in forming the updated objective function. 

In our view the GH and importance sampling approaches both have virtues. The main 
limitation of the importance sampling approach is that in many models the importance sample 
weights may themselves be computationally difficult to construct. Keane and Sauer (2010), 

■^While potential computational benefits have been noted in passing, we are not aware of any attempt to 
demonstrate these formally, in the manner of Theorems 5.3-5.5 below. 

®This is particularly evident when the auxiliary model consists of a system of regression equations, as per 
Section 4.4 below. For while smoothing does indeed reduce the variability of the simulated (discrete) outcomes 
VitiP, A), this may increase the variance with which some parameters of the auxiliary model are estimated, if yu 
appears as a regressor in that model: as will be the case for Models 2 and 3 (see Sections 6.2 and 6.3 below). (Note 
that any such increase, while certainly possible, is of only second-order importance, and disappears as A„ —>■ 0.) 
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when working with models similar to those in Section 2, assume that all variables are measured 
with error, which gives a structure that impies very simple weights. In many contexts such a 
measurement error assumption may be perfectly sensible. But the GII method can be applied 
directly to the models of Section 2 without adding any auxiliary assumptions (or parameters). 

4 Further refinements and the choice of auxiliary model 

4.1 Smoothing in dynamic models 

For models in which latent utilities depend on past choices (as distinct from past utilities, which 
are already smooth), such as Models 2 and 3 above, the performance of GII may be improved 
by making a further adjustment to the smoothing proposed in Section 3.3. The nature of this 
adjustment is best illustrated in terms of the example provided by Model 2. In this case, it is 
clear that setting 

•= Kx[biXit + &22/i(t-i(/3) + e™], 

where y™_;^(/3) denotes the unsmoothed choice made at date t — 1, will yield unsatisfactory 
results, insofar as the y^{l3,X) so constructed will remain discontinuous in /3. To some extent, 
this may be remedied by modifying the preceding to 


y(?(/3, A) := Kx[hxu + A) + e^], (4.1) 

with y^{l3, A) := 0, as per the specihcation of the model. However, while the y^{j3, A)’s generated 
through this recursion will indeed be smooth (i.e., twice continuously differentiable), the nesting 
of successive approximations entailed by (4.1) implies that for large t, the derivatives of y'^{ld, A) 
may be highly irregular unless a relatively large value of A is employed. 

This problem may be avoided by instead computing y'^{j3, A) as follows. Dehning := 

biXit + 62 l{^ = 1 } + see that the unsmoothed choices satisfy 

VitiP) = 1{tS)(/ 3) > 0} • [1 - 2/i,t-i(/3)] + HvZiP) > 0} • yi,t-iiP), 

which suggests using the following recursion for the smoothed choices, 

y)?(/3. A) := Kx[vZm • [1 - y7:t-M A)] + KAk?i(/3)] • A), (4.2) 

with y’^ifd, A) := 0. This indeed yields a valid approximation to yit{l3), as A —)■ 0. The smoothed 
choices computed using (4.2) involve no nested approximations, but merely sums of products 
involving terms of the form Kx[v'^f,{fd)]. The derivatives of these are well-behaved with respect 
to A, even for large t, and are amenable to the theoretical analysis of Section 5. 

Nonetheless, we hnd that even if smoothing is done by simply using (4.1), GII appears to 
work well in practice. This will be shown in the simulation exercises reported in Section 6 . 
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4.2 Bias reduction via jackknifing 

As we noted in Section 3.3, GII inherits the consistency of the II estimator, provided that t 0 
as n —)• oo. However, as smoothing necessarily imparts a bias to the sample binding function 
9nil3,Xn), and thence to the GII estimator, we need A„ to shrink to zero at a sufficiently fast 
rate if GII is to enjoy the same limiting distribution as the unsmoothed estimator. On the other 
hand, if A,i —)■ 0 too rapidly, derivatives of the GII criterion function will become highly irregular, 
impeding the ability of derivative-based optimization routines to locate the minimum. 

Except for certain special cases, the smoothing bias is of the order \\6{/3o, X)—6{/3o, 0)|| = 0(A) 
and no smaller. Thus, it is only dominated by the estimator variance if n^/^An = Op(l). On the 
other hand, it follows from Proposition 5.1 below that —)• oo is necessary to ensure 

that the Ith order derivatives of the GII criterion function converge, uniformly in probability, 
to their population counterparts. Here po £ (l,oo] depends largely on the order of moments 
possessed by the exogenous covariates x (see Assumption L below). Thus, even in the most 
favorable case of po = co, one can only ensure asymptotic negligibility of the bias (relative to 
the variance) at the cost of preventing second derivatives of the sample criterion function from 
converging to their population counterparts, a convergence that is necessary to ensure the good 
performance of at least some derivative-based optimization routines (see Section 5.6 below). 

Fortunately, these difficulties can easily be overcome by applying Richardson extrapolation 
- commonly referred to as “jackknifing” in the statistics literature - to the smoothed sample 
binding function. Provided that the population binding function is sufficiently smooth, a Taylor 
series expansion gives 6i{/3, A) = 9i{(3, 0) -|- ariiP)^^ + o(A^) as A —)■ 0, for / G {1,..., dg}. 

Then, for a fixed choice of (5 G (0,1), we have the first-order extrapolation, 

_ 9i{l3,SX)^-69i{l3,X) ^ _ i)a,,(/3)A” + o(A*), 

r=2 

for every I G {1,... , de}. By an iterative process, for k < s — 1 we can construct a fcth order 
extrapolation of the binding function, which satisfies 

k 

9\I5, X) := 7rfc0(/3, <5"A) = 0(/3,0) + ©(A^'+i), (4.3) 

r=0 

where the weights {'yrk}^=o (which can be negative) satisfy Y2r=o 'Xrk = Ij and may be calculated 
using Algorithm 1.3.1 in Sidi (2003). It is immediately apparent that the kth order jackknifed 
sample binding function, 

k 

ZiPAn) - Y.lrkenW^S^Xn) (4.4) 

r=0 

will enjoy an asymptotic bias of order Op{X^^), whence only = Op{l) is necessary for 

the bias to be asymptotically negligible. 

In the case where 9^{I3, A) = A)), for some differentiable transformation g oi a vector 

T™ of sufficient statistics (as in Section 5 below), jackknifing could be applied directly to these 
statistics. Thus, if we were to set 0™^(/3, A„) := g{Ylr=o^rkT^{l3, A„)), then ^ Xn) 

would also have an asymptotic bias of order Op{X^'^^). This approach may have computational 
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advantages if the transformation g is relatively costly to compute (e.g. when it involves matrix 
inversion). Note that since T™ will generally involve averages of nonlinear transformations of 
kernel smoothers, it will not generally be possible to achieve the same bias reduction through the 
use of higher-order kernels; whereas if only linear transformations were involved, both jackknifing 
and higher-order kernels would yield identical estimators (see, e.g., Jones and Foster, 1993). 

Jackknifed GII estimators of order k GNq may now be defined as the minimizers of: 


QnkiP^ ^n) 


'\\e':ifi,Xn)-krw^ ife=w 

< -Cn{y;x, 6^(13, Xn)) if e = LR 

ife = LM 


(4.5) 


where X;6n) ■= Ylr=o'Xrk^n{y^{l3, X),x;9n) denotes the jackknifed score function; the 

un-jackknifed estimators may be recovered by taking k = 0. Let Q^{j3,X) denote the large- 
sample limit of Qnkil^^ A); note that /3 i—)• Q|(/3,0) is smooth and does not depend on k. 


4.3 Bias reduction via a Newton-Raphson step 

By allowing the number of simulations M to increase with the sample size, we can accelerate the 
—k 

rate at which 9^ converges to the binding function. The convergence of the smoothed derivatives 
—k 

of 9^ should then follow under less restrictive conditions on A^. That is, it may be possible for 

the derivatives to converge, while still ensuring that the bias is o(n“^/^), even with k = 0. Since 

the evaluation of Qnk is potentially costly when M is very large, one possible approach would 

be to minimize Qnk using a very small initial value of M (e.g. M = 1). One could then increase 

M to an appropriately large value, and then compute a new estimate by taking at least one 

Newton-Raphson step (applied to the new criterion). 

A rigorous analysis of this estimator is beyond the scope of this paper; we assume that M is 

—k 

fixed throughout Section 5. Heuristically, since 0^ is computed using nM observations, it should 
be possible to show that if M = Mn —t oo, then the conditions specified in Proposition 5.1 below 
would remain the same, except with nM„ replacing every appearance of n in (5.7). 


4.4 Choosing an auxiliary model 

Efficiency is a key consideration when choosing an auxiliary model. As discussed in Section 3.1, 
indirect inference (generalized or not) has the same asymptotic efficiency as maximum likelihood 
when the auxiliary model is correctly specified in the sense that it provides a correct statistical 
description of the observed data (Gallant and Tauchen, 1996). Thus, from the perspective of 
efficiency, it is important to choose an auxiliary model (or a class of auxiliary models) that is 
flexible enough to provide a good description of the data. 

Another important consideration is computation time. For the Wald and LR approaches 
to indirect inference, the auxiliary parameters must be estimated repeatedly using different 
simulated data sets. For this reason, it is critical to use an auxiliary model that can be estimated 
quickly and efficiently. This consideration is less important for the LM approach, as it does not 
work directly with the estimated auxiliary parameters, but instead uses the first-order conditions 
(the score vector) that defines these estimates. 
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To meet the twin criteria of statistical and computational efficiency, in Section 6 we use linear 
probability models (or, more accurately, sets of linear probability models) as the auxiliary model. 
This class of models is flexible in the sense that an individual’s current choice can be allowed to 
depend on polynomial functions of lagged choices and of current and lagged exogenous variables. 
These models can also be very quickly and easily estimated using ordinary least squares. Section 6 
describes in detail how we specify the linear probability models for each of Models 1-4. For 
Model 5, the Heckman selection model, the auxiliary model would be a set of OLS regressions 
with mixed discrete/continuous dependent variables. 

5 Asymptotic and computational properties 

While GII could in principle be applied to any model of the form (2.1) - and others besides - 
in order to keep this paper to a manageable length, the theoretical results of this section will 
require that some further restrictions be placed on the structure of the model. Nonetheless, 
these restrictions are sufficiently weak to be consistent with each of Models 1-5 from Section 2. 
We shall only provide results for the Wald and LR estimators, when these are jackknifed as per 
(4.4) above; but it would be possible to extend our arguments so as to cover the LM estimator, 
and the alternative jackknifing procedure (in which the statistics T™ are jackknifed) outlined in 
Section 4.2. 

5.1 A general framework 

Individual i is described by vectors Xi G and r]i G of observable and unobservable 
characteristics; Xi includes all the covariates appearing in either the structural model or the 
auxiliary model (or both), ry* is a vector of independent variates that are also independent of 
Xi, and normalized to have unit variance. Their marginal distributions are fully specified by the 
model, allowing these to be simulated. Collect Zi := {xJ,r]J)'^ G and define the projections 
[x(-),r/(-)] so that {xi,r]i) = [x{zi),r]{zi)]. Individual i has a vector y{zi;l3,X) G of smoothed 
outcomes, parametrized by (/3, A) G B x A, with A = 0 corresponding to true, unsmoothed 
outcomes under /3. At this level of abstraction, we need not make any notational distinction 
between choices made by an individual at the same date (over competing alternatives), vs. 
choices made at distinct dates; we note simply that each corresponds to some element of ?/(•). 
With this notation, the mth simulated choices may be written as y{z'^; /3, A); since the same Xj’s 
are used across all simulations, we have x{z^) = x{z'^') but t]{zY^) ^ rj(zY^') for m' ^ m. 

In line with the discussion in Section 4.4, we shall assume that the auxiliary model takes the 
form of a system of seemingly unrelated regressions (SUR; see e.g. Section 10.2 in Greene, 2008) 

Ur^Zi, (3, A) — 0!^^TlxrX(^Zi') + CXy^TLyxl/i^Zi, (5, A) T (^'f) 

where := (.^ij,... ,^dyi)^ ~i.i.d. ^"[0,5]^], and li^r and Hyy are selection matrices (i.e. matrices 
that take at most one unit value along each row, and have zeros everywhere else); let •= 
(ajj., otyj.)'^ . Typically, will be assumed block diagonal: for example, we may only allow those 
pertaining to alternatives from the same period to be correlated. The auxiliary parameter 
vector 6 collects a subset (or possibly all) of the elements of (aj ,..., and the (unrestricted) 
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elements of (For the calculations involving the score vector in Appendix C, it shall be more 
convenient to treat the model as being parametrized in terms of 

Several estimators of 6 are available, most notably OLS, feasible GLS, and maximum likeli¬ 
hood, all of which agree only under certain conditions.® For concreteness, we shall assume that 
both the data-based and simulation-based estimates of 6 are produced by maximum likelihood. 
However, the results of this paper could be easily extended to cover the case where either (or 
both) of these estimates are computed using OLS or feasible GLS. (In those cases, the auxiliary 
estimator can be still be written as a function of a vector of sufficient statistics, a property that 
greatly facilitates the proofs of our results.) 

We shall also need to restrict the manner in which y(-) is parametrized. To that end, we 
introduce the following collections of linear indices 


Ur{z-,l3) := 

r G {1,.. 

■ tdi/) 

(5.2a) 

UJr{z]l3) := z'^Ucorlifi) 

r G {1,.. 


(5.2b) 


where 7 : B —)■ F, and H)' and Uf are selection matrices. We shall generally suppress the z 
argument from 1 / and lo, and other quantities constructed from them, throughout the sequel. 
Our principal restriction on y(-) is that it should be constructed from ( 1 ^, 0 ;) as follows. Let 
dc > doj] for each r G { 1 ,..., dc}, let 5 ^ C { 1 ,..., d^} and define 

yr{l3,X) ■■=ojr{P) ■ n ■^a[j^5(/3)] (5.3) 

collecting these in the vector i/(/3, A); where now iL : M —)• [0,1] is a smooth univariate cdf, and 
dXi\{v) := Note that dc > dco, and that we have defined 

Uriz]/3) := 1 r G {duj + 1,... ,dc}. (5.4) 

Let rjuj '■= select the elements of t] upon which cj actually depends (as determined by the 

Yl^r matrices), and let Wr > 1 denote an envelope for in the sense that \ujr{z] (d)\ < Wr{z) 
for all /? G B. Let £)min(^) denote the smallest eigenvalue of a symmetric matrix A. 

Our results rely on the following low-level assumptions on the structural model: 
Assumption L (low-level conditions). 

Li (uijXi) is i.i.d. over i, and = r]{z'^) is independent of Xi and i.i.d. over i and m; 

L 2 y(/3, A) = Dy{ld, A) for some D G y gg (5.3); 

L 3 7 : B — 7 - F in (5.2) is twice continuously differentiable; 

®In Section 6 , exact numerical agreement between these estimators is ensured by requiring the auxiliary model 
equations referring to alternatives from the same period to have the same set of regressors. 

^Keane and Smith (2003) suggested using the multivariate logistic cdf, L{v) := 1/(1 + ^^rid this is 

used in the simulation exercises presented in Section 6 . But L has no particular advantages over other choices of 
K, and, for the theoretical results work we shall in fact assume that the smoothing is implemented using suitable 
products of univariate cdfs. This assumption eases some of our arguments (but it is unlikely that it is necessary 
for our results). 
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L4 for each k G {1,... ,dri}, Yai{riki) = 1, and r]j^i has a density with 

sup(l + \u\^)fk{u) < oo; 

uGM. 


L5 there exists an e > 0 such that, for every for every r G {1, ..., d^,} and /3 G B, 


va.v{vr{zi;/3) \ rj,_,i,Xi) > e; 


L6 there exists apo > 2 such that for each r G {1,..., dc}, E(PB^+|| 2 ;j||^) < oo, E| < 

oo and E| < co; 

L7 inf(^i 3 ^x)&BxA Qmm[^y{zi; P, X)y{zi; 13, X)'''] > 0, where y{l3,\) := [y(/3, A)’’’, x’’’]"'’; and 

L8 the auxiliary model is a Gaussian SUR, as in (5.1). 

Remark 5.1. (5.2) entails that the estimator criterion function Qn depends on fd only through 
7 (/?), i.e. Qni/3) = Qni'jild)) for some Qn- Since the derivatives of Qn with respect to 7 take a 
reasonably simple form, we shall establish the convergence of d^j^Qn to dj^Q, for I G {1,2}, by first 
proving the corresponding result for d^Qn and then applying the chain rule. Here, as elsewhere 
in the paper, 9^/ denotes the gradient of / : B —)• (the transpose of the Jacobian), and d^f 
the Hessian; see Section 6.3 of Magnus and Neudecker, 2007, for a definition of the latter when 
k>2. 

Remark 5.2. Assumption L is least restrictive in models with purely discrete outcomes, for which 
we may take d^ = 0. In particular, L6 reduces to the requirement that E|| 2 ;j|p^’° < 00 . 

Remark 5.3. As the examples discussed in Section 5.2 illustrate, except in the case where current 
(discrete) choices depend on past choices, it is generally possible to take D = 1^^ in L 2 , so that 
2/(/3, A) = y(/3, A). 

Consistent with the notation adopted in the previous sections of this paper, let t?™' denote 
the mth set of simulated unobservables, and y™'(/3,A) the associated smoothed outcomes, for 
m G {1,..., M}. We may set A = [0,1] below without loss of generality. Let T denote a fi-field 
with respect to which the observed data and simulated variables are measurable, for all n, and 
recall the definition of ^ given in (3.1) above. We then have the following: 

Assumption R (regularity conditions). 

Ri The structural model is correctly specified: yi = y{z^', /3o,0) for some /3o G intB; 

R 2 00 •= ^(/3o, 0) G int 0; 

R3 the binding function 9{I3,X) is single-valued, and is (kod-l)-times differentiable in fd for all 
{jd, A) G (int B) x A; 

R4 /3 I—)■ 0(/3,O) is injective; 

R5 {A„} is an IF-measurable sequence with A^ ^ 0; 

R6 the order k G {1,..., fco} of the jackknifing is chosen such that = Op{l); 
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R7 K in (5.3) is a twice continuously differentiable cdf, for a distribution having integer mo¬ 
ments of all orders, and density K symmetric about the origin; and 

P 

R8 Wn —)• W, for some positive definite W. 

Remark 5.4. R4 formalizes the requirement that the auxiliary model be “sufficiently rich” to 
identify the parameters of the structural model; dg > dg evidently is necessary for R4 to be 
satisfied. 

Remark 5.5. R5 permits the bandwidth to be sample-dependent, as distinct from assuming it 
to be a “given” deterministic sequence. This means our results hold uniformly in smoothing 
parameter sequences satisfying certain growth rate conditions: see Remark 5.7 below for details. 
R6 ensures that, in conjunction with the choice of Xn, the jackknifing is such as to ensure that 
the bias introduced by the smoothing is asymptotically negligible. R7 will be satisfied for many 
standard choices of K, such as the Gaussian cdf, and many smooth, compactly supported kernels. 

Assumptions L and R are sufficient for all of our main results. But to allow these to be 
stated at a higher level of generality - and thus permitting their application to a broader class of 
structural and auxiliary models than are consistent with Assumption L - we shall find it useful 
to phrase our results as holding under Assumption R and the following high-level conditions. To 
state these, define CffO) := Cn{y,x;9), C{9) := 'ECn{9) and i'f'{ft, X; 9) := i{yf^{/3, X),Xi;9). I'f' 
and Cn respectively denote the gradient of and the Hessian of £„ with respect to 9, while for 
a metric space {Q, d), i°°{Q) denotes the space of bounded and measurable real-valued functions 
on Q. 

Assumption H (high-level conditions). 

HI Cn is twice continuously differentiable on int 0; 

H2 for I G {0,1, 2}, dgCn{9) and 

I ” 

-EC(/3i,Ai;0i)C(/52,A2;02)''4eC(/3i,Ai;0i)C(/32,A2;02)^ 

PI . . 

1=1 

uniformly on B x A and eompact subsets of int 0, for every mi, m 2 G {0,1,..., M}; 

H3 V’™' is a mean-zero, continuous Gaussian process on B x A such that 

f;ff{f,X) := n^/^[C{f,X) - 0(/3,A)] - ^f^iffX) 

in f'°°(B X A), jointly zn m G {0,1,..., M}; 

H4 for any (possibly) random sequence fn = fto + Op{l) and Xn as in R5, 

1 ” 

f^ffifnAn) = -F-1—J^C(/3o,O;0o) + Op(l) =: +Op(l) - (5.5) 

Tl ' . 

1=1 

jointly inm ^ {0,1,... , M},^ where H := E£jj(0) = C{9) and is jointly Gaussian 

®The first equality in (5.5) is only relevant for m > 1. 
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with 


S := R := E(/>”^i= ECC"”^ (5-6) 

for every mi, m 2 G {0,1,..., M}; and 
H5 {A„} is sueh that for some I G {0,1, 2}, 

sup||4C(/3>^n) - 5^(9(/3,0)|| = Op(l). 

/3eB 

The sufficiency of our low-level conditions for the preceding may be stated formally follows. 

Proposition 5.1. Suppose Assumptions L and R hold. Then Assumption H holds with I = 0 in 
H5. Further, i/ A„ > 0 for all n, with 

^i-i/poa 2;'-1/ iog(A-i V n) 4 00 (5.7) 

for some I' G {1,2}, then H5 holds with 1 = 1'. 

Remark 5.6. It is evident from (5.7) that - as noted in Section 4.2 above - the convergence of 
the higher-order derivatives of the sample binding function requires more stringent conditions on 
the smoothing sequence {An}. We shall be accordingly careful, in stating our results below, to 
identify the weakest form of H5 (and correspondingly, of (5.7)) that is required for each of these. 

Remark 5.7. Let {A^} and {An} be deterministic sequences satisfying (5.7) and An = o(l) 
respectively, and set An := [A„, An]. Then, as indicated in Remark 5.5 above, J^-measurability of 
{An} entails that the convergence in H5 holds uniformly over A G An, in the sense that 

sup ||4C(/54)-5^0(/3,A)|| =Op(l). 

(/3,A)eBxA„ 

A similar interpretation applies to Theorems 5.3~5.5 below. 

Proposition 5.1 is proved in Appendix C. 

5.2 Application to examples 

We may verify that each of the models from Section 2 satisfy L2-L6. In all cases, Xi collects 
all the (unique) elements of {xji}4i, together with any additional exogenous covariates used to 
estimate the auxiliary model; while rji collects the elements of {??ji}^i. Note that for the discrete 
choice Models 1-4, since the ry are Gaussian L6 will be satisfied if E||xi|p^° < 00 . L7 is a standard 
non-degeneracy condition. 

Model 1. Uit = bxit -|- Yll=i ^Vis by backward substitution. So we set (dj^, d^,) = {T, 0), with 

t 

ut{zi; (3) = xt{zi)b{(3) -F ^ r]s{zi)dts{P), 

S=1 

where (3 = {b,r), b{l3) = b and dts{l3) = while xt{zi) and rjs{zi) select the appropriate 
elements of Zi, which collects {xu}, {Vit}, and any other exogenous covariates used in the auxiliary 
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model. Thus L 2 and L3 hold (formally, take '){(3) = {b{l3),{dts{l3)})). L5 follows from the r]t{ziys 
being standard Gaussian. 

Model 2. As per the discussion in Section 4.1, and (4.2) in particular, we define 

t 

m{zi;(3) ■■= xt{zi)bi{/3) + b2{/3)l{k = 1 } + '^r]s{zi)dts{l3) 

S = 1 

where the right-hand side quantities are defined by analogy with the preceding example. Setting 
yt{P,X) ■= Kx[m{/3)] ■ [I - yt-i{l3, X)] + Kx[m{f3)] ■yt-i{f3,X) (5.8) 

with yQ{ld,X) := 0 thus yields smoothed choices having the form required by L2 and L3, as may 
be easily verified by backwards substitution. L5 again follows from Gaussianity of r]t{zi)- 

An identical recursion to (5.8) also works for Model 3. Model 4 may be handled in a similar 
way to Model 1, but it is in certain respects simpler, because the errors are not serially dependent. 
Finally, it remains to consider: 

Model 5. From the preceding examples, it is clear that uj{zi; (3) = Wi and v{zi] P) = Ui can be 
written in the linear index form (5.2). The observable outcomes are the individual’s decision to 
work, and also his wage if he decides to work. These may be smoothly approximated by: 

yi(/3,A) :=Kx[izm y2{/3,X) := uj{(3) ■ KxHf3)]. 

respectively. Thus L2-L5 hold just as in the other models. L6 holds, in this case, if E|| 2 :j||^P° < oo. 

5.3 Limiting distributions of GII estimators 

We now present our asymptotic results. Note that Assumptions R and H are maintained through¬ 
out the following (even if not explicitly referenced), though in accordance with Remark 5.6 above, 
we shall always explicitly state the order of I in H5 that is required for each of our theorems. 

Our first result concerns the limiting distributions of the minimizers of the Wald and LR 
criterion functions, as displayed in (4.5) above. For e G {W,LR}, let be a near-minimizer of 
in the sense that 

QlkWnk^^n) < inf Q^fc(/3,A„)+ Op(n"^). (5.9) 

pGB 

The limiting variance of both estimators will have the familiar sandwich form. To allow the next 
result to be stated succinctly, define 

VL{U,V) := {G^UG)-^G^UH-^VH-^UG{G^UG)-^ (5.10) 

where G := [d^0(/3o, 0)]"'' denotes the Jacobian of the binding function at (/3o,0), H = E£„(0), 
and U and V are symmetric matrices. 

Theorem 5.1 (limiting distributions). Suppose H5 holds with 1 = 0. Then 

n^/^0yf,-Po)^N[O,n{Ue,Ve)], 
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where 


W if e = W 
H if e = lM 




(5.11) 


Remark 5.8. In view of Proposition 5.1 and the remark that follows it, Theorem 5.1 does not 
restrict the rate at which An —?■ 0 from below, indeed, it continues to hold even if An = 0 for all n, 
in which case the estimation problem is closely related to that considered by Pakes and Pollard 
(1989). Thus, while the theorem provides the “desired” limiting distribution for our estimators, 
it fails to provide a justification (or motivation) for the smoothing proposed in this paper, and 
is in this sense unsatisfactory (or incomplete). 

Remark 5.9. Note that the order of jackknifing does not affect the limiting distribution of the 
estimator: this has only a second-order effect, which vanishes as An —?• 0. 

Remark 5.10. It is possible to define the LR estimator as the minimizer of 


gL^(/3) :=-£n(y;x,0n(/3,An)), 


where the average log-likelihood Cn need not correspond to that maximized by 0™, provided 
that the maximizers of both Cn and £„ are consistent for the same parameters. For example, 
0™ might be OLS (and the associated residual covariance estimators), whereas Cn is the average 
log-likelihood for a SUR model. Suppose that the maximizer 9n of Cn satisfies the following 
analogue of (5.5), 

1 ” • 

n^/^iOn - eo) = 0 ; ^ 


and define S := , and R := for m > 1. Then the conclusions of Theorem 5.1 

continue to hold, except that the appearing in (5.10) must be replaced by 




-F R-Wh-^) + R-^ 





R-^ 


(5.12) 


and 17 lr = R, where R ;= "KCniy■, x] 9 q). Regarding the estimation of these quantities, see 
Remark 5.11 below. (Note that (5.12) reduces to R~^VR~^ when H = R, T, = C and R = R.) 

The proofs of Theorem 5.1 and all other theorems in this paper are given in Appendix B. 


5.4 Convergence of smoothed derivatives and variance estimators 

Theorem 5.1 fails to indicate the possible benefits of smoothing, because it simply posits the 
existence of a near-minimizer of Qnki s-nd thus entirely ignores how such a minimizer might 
be computed in practice. Ideally, smoothing should be shown to facilitate the convergence of 
derivative-based optimization procedures, when these are applied to the problem of minimizing 
Qnk, while still yielding an estimator having the same limit distribution as in Theorem 5.1. 

For the analysis of these procedures, the large-sample behavior of the derivatives of Qnk 
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will naturally play an important role.® The uniform convergence of the derivatives of the sample 
binding function - and hence those of Qnk ^ follows immediately from H5, and sufficient conditions 
for this convergence are provided by Proposition 5.1 above. Notably, when V G {1,2}, (5.7) 
imposes exactly the sort of lower bound on that is absent from Theorem 5.1. 

H5, with / = 1, implies that the derivatives of the smoothed criterion function can be used 
to estimate the Jacobian matrix G that appears in the limiting variances in Theorem 5.1. The 
remaining components, H and 14, can be respectively estimated using the data-based auxiliary 
log-likelihood Hessian, and an appropriate transformation of the joint sample variance of all the 
auxiliary log-likelihood scores (i.e. using both the data- and simulation-based models). Define 



where 6'^ := and i%0) denotes the gradient of £{yi ,Xi]6). Then we have 

Theorem 5.2 (variance estimation). Suppose H5 holds with I = 0. Then 

(i) Hn ■■= Cn{D 4 H; 

(ii) Vn Ta=i Snisli) 4 4 H; and 

if H5 holds with I = 1, then 

(hi) Gn := 5^4(/3:fc, Xn) 4 G, for e G {W, LR}. 

Remark 5.11. For the situation envisaged in Remark 5.10, so long as the auxiliary model corre¬ 
sponding to Cn satisfies Assumptions R and H, a consistent estimate of (5.12) can be produced 
in the manner of (ii) above, if we replace Sni by 



where Hn ■= Cn{0n) is consistent for T/lr. 

5.5 Performance of derivative-based optimization procedures 

The potential gains from smoothing may be assessed by comparing the performance of derivative- 
based optimization procedures, as they are applied to each of the following: 

PI the smoothed sample problem, of minimizing (3 i—)■ QnkiP, Xn)', and 

P 2 its population counterpart, of minimizing /3 i—)■ Qk{/3,0). 

Since Qk is automatically smooth (even when A = 0, owing to the smoothing effected by the 
expectation operator), derivative-based methods ought to be particularly suited to solving P2, 
and we may regard their performance when applied to this problem as representing an upper 
bound for their performance when applied to Pi. 

®Here, as throughout the remainder of this paper, we are concerned exclusively with the limiting behavior of 
the exact derivatives of Qnk, ignoring any errors that might be introduced by numerical differentiation. 
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In the following section, we shall discuss in detail the convergence properties of three popular 
optimization routines: Gauss-Newton; quasi-Newton with BFGS updating; and a trust-region 
method. But before coming to these, we first provide a result that is of relevance to a broader 
range of derivative-based optimization procedures. Since such procedures will typically be de¬ 
signed to terminate at (near) roots of the first-order conditions, 

d/sQnkiP, K) = 0 in PI 0) = 0 in P2 

for e G {W, LR}, we shall provide conditions on A^, under which, for some Cn = Op(l), 

(i) the set := {/3 G B | An)|| < c^} of near roots is “consistent” for subsets of 

Re := {^ G B I dpQ%/3,0) = 0}; and 

(ii) if Cn = then any /in G with Pn ^ Po has the limiting distribution given by 

Theorem 5.1. 

We interpret (i) as saying that smoothing yields a sample problem pi that is “no more difficult” 
than the population problem P2, in the sense that the set of points to which derivative-based 
optimizers may converge to in Pi approximates its counterpart in P 2 , as n —)■ oo. This is the 
strongest consistency result we can hope to prove here: as Q may have multiple stationary points, 
only one of which coincides with its (assumed interior) global minimum, it cannot generally be 
true that the whole of will be consistent for /3o- On the other hand, if we can select a 
consistent sequence of (near) roots from as in (ii), then we may reasonably hope that this 
estimator sequence will enjoy the same limiting distribution as a (near) minimizer of Qnk- 

For R C B, let dL{A,B) := sup^jg^ (i(a, R) denote the one-sided distance from A to R, 
which has the property that dL{A,B) = 0 if and only if vl C R. Recall the definition of 
given in (5.9) above. Properties (i) and (ii) above can be more formally expressed as follows. 

Theorem 5.3 (near roots). Suppose H5 holds with 1 = 1. Then 

(i) R®^ is nonempty w.p.a.l., and diiRnki^'^) ^ 0/ 

(ii) if Cn = Op(n“^/^), ^n G Rnk Pn ^ /^O; then - (dnk) = Op{l), and so f3n has the 

limiting distribution given by Theorem 5.1; and 

(hi) any jdn G satisfying Q^kiPn) < Qnkil^) + Cp(l) has jdn ^ ^o- 

Remark 5.12. Of course, the requirement that jdn —t jdo cannot be verified in practice; but one 
may hope to satisfy it by running the optimization routine from L different starting points located 
throughout B, obtaining a collection of terminal values {/^nzl^D and then setting ^n = l^nl such 
that Qnkifin^ Si Qnkifdnl'') all I G {1, • . • , R}. 

Some optimization routines, such as the trust-region method considered in the next section, 
may only be allowed to terminate when the second-order conditions for a minimum are also 
satisfied. Defining 

s:, := {/? G R<nk I QminidjQlki/3, An)] > 0} := {/3 G R^ I Qmin[djQ^{P, 0)] > 0}, (5.13) 

we have the following 
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Theorem 5.4 (near roots satisfying second-order conditions). Suppose H5 holds with I = 2. 
Then parts (i) and (ii) of Theorem 5.3 hold with and 5® in place of and R respectively. 

Remark 5.13. The utility of this result may be seen by considering a case in which Q has 
many stationary points, but only a single local minimum at /?o. Then while Theorem 5.3 only 
guarantees convergence to one of these stationary points. Theorem 5.4 ensures consistency for /3o 
- at a cost of requiring that the routine also check the second-order conditions for a minimum. 
This is why stronger conditions must be imposed on An in Theorem 5.4; we now need the second 
derivatives of Qnk to provide reliable information about the curvature of Qk in large samples. 

5.6 Convergence results for specific procedures 

Our final result concerns the question of whether certain optimization routines, if initialized 
from within an appropriate region of the parameter space and iterated to convergence, will 
yield the maximizer of Qnk, and thus an estimator having the limiting distribution displayed in 
Theorem 5.1. In some respects, our work here is related to previous work on k-step estimators, 
which studies the limiting behavior of estimators computed as the outcome of a sequence of 
quasi-Newton iterations (see e.g. Robinson, 1988). However, we shall depart from that literature 
in an important respect, by not requiring that our optimization routines be initialized by a 
sequence of starting values /3n'^ that are assumed consistent for /3o (often at some rate). Rather, 
we shall require only that G Bq C B for a fixed region Bq satisfying the conditions noted 
below. 

We consider two popular line-search optimization methods - Gauss-Newton, and quasi- 
Newton with BFGS updating - as well as a trust-region algorithm. When applied to the problem 
of minimizing an objective Q, each of these routines proceed as follows: given an iterate 
locally approximate Q by the following quadratic model, 

/(s)(/3) := + Vj)(/3 - /3W) + i(/3 - /3W)''A(,)(/3 - /?«), (5.14) 

where V(s) := A new iterate is then generated by approximately minimizing 

/(s) with respect to (3. The main differences between these procedures concern the choice of 
approximate Hessian and the manner in which is (approximately) minimized. A com¬ 

plete specification of each of the methods considered here is provided in Appendix A (see also 
Fletcher, 1987, and Nocedal and Wright, 2006); note that the Gauss-Newton method can only 
be applied to the Wald criterion function, since only this criterion has the least-squares form 
required by that method. 

We shall impose the following conditions on the population criterion Q, which are sufficient 
to ensure that each of these procedures, once started from some G Bq, will converge to the 
global minimizer of Q. As noted above, since Q may have many other stationary points, Bq 
must be chosen so as to exclude these (except when the trust region method is used); hence 
our convergence results are of an essentially local character. (Were we to relax this condition 
on Bq, then the arguments yielding Theorem 5.5 below could be modified to establish that 
these procedures always converge to some stationary point of Q.) To state our conditions, let 
o'mmiD) '■= denote the smallest singular value of a (possibly non-square) matrix D, 
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and recall G(/?) = 0)]"'', the Jacobian of the binding function. 

Assumption O (optimization routines). Let Q G {Q^, Then Bq = Bo((5) may be ehosen 

as any eompaet subset o/intB for whieh ( 3 q G intBo and Bq = {/? G B | Q{l3) < Q{l3i)} for some 
Pi G B; and either 

GN \\G{P)'^Wg{P)\\ / 0 for all /3 G Bo\{/3o} and inf^eBo (^mm[G{P)] > 0; 

QN Q is strietly eonvex on Bq; or 

TR for every P G Bo\{/3o}; ||9/3Q(/3)|| = 0 implies Qrmn[dlQ{P)\ < 0. 

Remark 5.14. Note that \\G{P)'^Wg{P)\\ / 0 is equivalent to ||9/3Q^(/3)|| / 0. Both GN and 
QN thus imply that Q has no stationary points in Bq, other than that which corresponds to the 
minimum at Pq. TR, on the other hand, permits such points to exist, provided that they are 
not local minima. In this respect, it places the weakest conditions on Q, and does so because 
the trust-region method utilizes second-derivative information in a manner that the other two 
methods do not. 

Before analyzing the convergence properties of these optimization routines, we must first 
specify the conditions governing their termination. Let denote the sequence of iterates 

generated by a given routine r, from some starting point P^^\ When r G {GN, QN}, we shall 
allow the optimization to terminate at the first s - denoted s* - for which a near root is located, 
in the sense that < Cn, where Cn = Op{n~^Gf That is, s* is the smallest s for 

which P^^'l G R^f.- This motivates the definition, for r G {GN, QN}, of 

— e (n\ f if P^^'* £ Rth foi' some s G N 

AVfl'°\r) := r ( 5 . 15 ) 

I P^^' otherwise, 

which describes the terminal value of the optimization routine, with the convention that this is 
set to if a near root is never located. In the case that r = TR, we shall allow the routine 
to terminate only at those near roots at which the second-order sufficient conditions for a local 
minimum are also satisfied. In this way, s* now becomes the smallest s for which pGl g S'™, 
and P^i^{P^^PTK) may be defined exactly as in (5.15), except with S'™ in place of 
For the purposes of the next result, let denote the exact minimizer of 

Theorem 5.5 (derivative-based optimizers). Suppose r G {GN, QN, TR} and e G {W, LR}, and 
that the corresponding part of Assumption O holds for some Bq. Then 

sup - PnkW = 

/3(o)eB 

holds if either 

may be asked why we do not also propose checking the second-order conditions upon termination when 
r G {GN, QN}. Such a modification is certainly possible, but is perhaps of doubtful utility. Consider the problem 
of minimizing some (deterministic) criterion function that has multiple roots, only one of which corresponds to a 
local (and also global) minimum, a scenario envisaged in TR. In this case, the best we can hope to prove is that 
the Gauss-Newton and quasi-Newton routines will have some of those roots as points of accumulation, but they 
might never enter the vicinity of the local minimum (see Theorems 6.5 and 10.1 in Nocedal and Wright, 2006). 
On the other hand, the trust-region algorithm considered here is guaranteed to have the local minimum as a point 
of accumulation, under certain conditions (see More and Sorensen, 1983, Theorem 4.13). 
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(i) (r, e) = (GN,W) and H5 holds with I = 1; or 

(ii) r G {QN, TR} and H5 holds with I = 2. 

Remark 5.15. Convergence of the Gauss-Newton method requires the weakest conditions on of 

all three algorithms. This is because the Hessian approximation A„ (g) := WnGn{l3^^'^) 

used by Gauss-Newton is valid for criteria having the same minimum-distance structure as ; 
—k 

here Gnifd) '■= dii6^{(3,\n)- Thus the uniform convergence of Gn is sufficient to ensure that 
A„ (s) behaves suitably in large samples, whence only H5 with Z = 1 is required. 

6 Monte Carlo results 

This section conducts a set of Monte Carlo experiments to assess the performance of the GII 
estimator, in terms of bias, efficiency, and computation time. The parameters of Models 1-4 
(see Section 2) are estimated a large number of times using “observed” data generated by the 
respective models. For each model, the Monte Carlo experiments are conducted for several sets 
of parameter configurations. For Models 1, 2, and 4, the parameters are estimated in each Monte 
Carlo replication using both GII and simulated maximum likelihood (SML) in conjunction with 
the GHK smooth probability simulator (cf. Lee, 1997). Model 3, which cannot easily be estimated 
via SML, is estimated using only GIL We omit Model 5, as Altonji, Smith, and Vidangos (2013) 
already present results showing that GII performs well for Heckman selection-type models. 

In all cases, we use the LR approach to (generalized) indirect inference to construct our 
estimates. We do this for two reasons. First, unlike the Wald and LM approaches, the LR 
approach does not require the estimation of a weight matrix. In this respect, the LR approach is 
easier to implement than the other two approaches. Furthermore, because estimates of optimal 
weight matrices often do not perform well in finite samples (see e.g. Altonji and Segal, 1996), 
the LR approach is likely to perform better in small samples. Second, because the LR approach 
is asymptotically equivalent to the other two approaches when the auxiliary model is correctly 
specified, the relative inefficiency of the LR estimator is likely to be small when the auxiliary 
model is chosen judiciously. 

To optimize the criterion functions, we use a version of the Davidon-Fletcher-Powell algorithm 
(as implemented in Ghapter 10 of Press, Flannery, Teukolsky, and Vetterling, 1993), which is 
closely related to the quasi-Newton routine analyzed in Section 5.6. The initial parameter vector 
in the hillclimbing algorithm is the true parameter vector. Most of the computation time in 
generalized indirect inference lies in computing ordinary least squares (OLS) estimates. The main 
cost in computing OLS estimates lies, in turn, in computing the X part of {X'^X)~^X'^Y. 
We use blocking and loop unrolling techniques to speed up the computation of X'^X by a factor 
of 2 to 3 relative to a “naive” algorithm. 

6.1 Results for Model 1 

Model 1 is a two-alternative panel probit model with serially correlated errors and one exogenous 

avoid redundant calculations, we also precompute and store for later use those elements of X~^X that 
depend only on the exogenous variables. We are grateful to James MacKinnon for providing code that implements 
the blocking and loop unrolling techniques. 
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regressor. It has two unknown parameters: the regressor coefficient b, and the serial correlation 
parameter r. We set 6 = 1 and consider r £ {0,0.40,0.85}. In the Monte Carlo experiments, 
n = 1000 and T = 5. As in all of the simulation exercises carried out in this paper, we compute 
the GII estimator via the two-step approach described in Section 4.3, using (A, M) = (0.03,10) 
in the first step, and (A, M) = (0.003, 300) in the second. The exogenous variables (the xu's) are 
i.i.d. draws from a A^[0,1] distribution, drawn anew for each Monte Carlo replication. 

The auxiliary model consists of T linear probability models of the form 

Uit — ^it^t T 

where A^[0, cj|], Zit denotes the vector of regressors for individual i in time period t, and 

at and Uj are parameters to be estimated. We include in zu both lagged choices and polynomial 
functions of current and lagged exogenous variables; the included variables change over time, so as 
to allow the auxiliary model to incorporate the additional lagged information that is available in 
later time periods. (When estimating the model on simulated data, the simulated lagged choices 
are of course replaced by their smoothed counterparts, as per the discussion in Section 4.1 above.) 
The auxiliary model is thus characterized by the parameters 9 = {a*, these are estimated 

by maximum likelihood (which corresponds to OLS here, under the distributional assumptions 
on ^it). 

It is worth emphasizing that we include lagged choices (and lagged x’s) in the auxiliary model 
despite the fact that the structural model does not exhibit true state dependence. But in Model 
1 it is well-know that lagged choices are predictive of current choices (termed “spurious state 
dependence” by Heckman). This is a good illustration of how a good auxiliary model should be 
designed to capture the correlation patterns in the data, as opposed to the true structure. 

To examine how increasing the “richness” of the auxiliary model affects the efficiency of the 
structural parameter estimates, we conduct Monte Carlo experiments using four nested auxiliary 
models. In all four, we impose the restrictions at = aq and = fig, t = q 1,... ,T, for some 
q < T. This is because the time variation in the estimated coefficients of the linear probability 
models comes mostly from the non-stationarity of the errors in the structural model, and so it is 
negligible after the first few time periods (we do not assume that the initial error is drawn from 
the stationary distribution implied by the law of motion for the errors). 

In auxiliary model ^1, q = 1 and the regressors in the linear probability model are given by: 
Zit = (1, Xit, Ui^t-i), t = 1,..., r, where the unobserved is set equal to 0. We use this very 
simple auxiliary model to illustrate how GII can produce very inefficient estimates if one uses a 
poor auxiliary model. In auxiliary model 7 (^ 2 , q = 2 and the regressors are zn = (l,Xii), and 

Zit — (1) 1) 1 )) t £ {2, . • • , T}, 

giving a total of 18 parameters. Auxiliary model #3 has y = 4, regressors 

Zii = {l,Xii,X^i) Zi3 = il,Xi3,yi2,Xi2,yil,Xii) 

Zi2 — (1 ) Xi2 , yn , Xji) Zit — ( 1 ) Xiti yi^t—li Xi^t—li yi,t—2i Xi,t—2^ t £ {4, • • • , T}, 
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and 24 parameters. Finally, auxiliary model ^4 has the same regressors as ^3, except that 

ZiA — (1) Tj4, yj3, Xj3, yi2 , ; l/i 1; ) 

— ( 1 ) Xitj yi^t—1: yi,t—2 ) 2) Xi^t— 3 : 2/i,t—4); t £ • • • ; 

SO (7 = 5 and there are 35 parameters. 

Table 1 presents the results of six sets of Monte Carlo experiments, each with 2000 replica¬ 
tions. The first two sets of experiments report the results for simulated maximum likelihood, 
based on GHK, using 25 draws (SML ^1) and 50 draws (SML #2). The remaining four sets 
of experiments report the results for generalized indirect inference, where GII refers to gen¬ 
eralized indirect inference using auxiliary model In each case, we report the average and 
the standard deviation of the parameter estimates. We also report the efficiency loss of GII 
relative to SML ^2 in the columns labelled cjgii/c’'smL) where we divide the standard deviations 
of the GII estimates by the standard deviations of the estimates for SML ^2. Finally, we report 
the average time (in seconds) required to compute estimates (we use the Intel Fortran Compiler 
Version 7.1 on a 2.2GHz Intel Xeon processor running Red Hat Linux). 

Table 1 contains several key findings: 

First, both SML and GII generate estimates with very little bias. 

Second, GII is less efficient than SML, but the efficiency losses are small provided that the 
auxiliary model is sufficiently rich. For example, auxiliary model ^1 leads to large efficiency 
losses, particularly for the case of high serial correlation in the errors (r = 0.85). For models 
with little serial correlation (r = 0), however, auxiliary model ^2 is sufficiently rich to to make 
GII almost as efficient as SML. When there is more serial correlation in the errors, auxiliary model 
^2 leads to reasonably large efficiency losses (as high as 30% when r = 0.85), but auxiliary model 
7 (^ 3 , which contains more lagged information in the linear probability models than does auxiliary 
model ^2, reduces the worst efficiency loss to 13%. Auxiliary model ^4 provides almost no 
efficiency gains relative to auxiliary model ^3. 

Third, GII is faster than SML: computing a set of estimates using GII with auxiliary model 
^2> takes about 30% less time than computing a set of estimates using SML with 50 draws. 

For generalized indirect inference, we also compute (but do not report in Table 1) estimated 
asymptotic standard errors, using the estimators described in Theorem 5.2. In all cases, the 
averages of the estimated standard errors across the Monte Garlo replications are very close to 
(within a few percent of) the actual standard deviations of the estimates, suggesting that the 
asymptotic results provide a good approximation to the behavior of the estimates in samples of 
the size that we use. 

6.2 Results for Model 2 

Model 2 is a panel probit model with serially correlated errors, a single exogenous regressor, and 
a lagged dependent variable. It has three unknown parameters: 61 , the coefficient on the exoge¬ 
nous regressor, 62 , the coefficient on the lagged dependent variable, and r, the serial correlation 
parameter. We set bi = 1, 62 = 0.2, and consider r G {0,0.4,0.85}; n = 1000 and T = 10. 

Table 2 presents the results of six sets of Monte Garlo experiments, each with 1000 replica- 
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tions; the labels SML and GII are to be interpreted exactly as for Table 1. The results are 
similar to those for Model 1. Both SML and GII generate estimates with very little bias. SML 
is more efficient than GII, but the efficiency loss is small when the auxiliary model is sufficiently 
rich (i.e., 17% at most for model ^3, 15% at most for model ^4). However, auxiliary model 
7^1 can lead to very large efficiency losses, as can auxiliary model ^2 if there is strong serial 
correlation. 

Again, average asymptotic standard errors are close to the standard deviations obtained 
across the simulations (not reported). Finally, GII using auxiliary model ^3 is about 25% faster 
than SML using 50 draws. 

6.3 Results for Model 3 

Model 3 is identical to Model 2, except there is an “initial conditions” problem: the econometrician 
does not observe individuals’ choices in the first s periods. This is an excellent example of the 
type of problem that motivates this paper: SML is extremely difficult to implement, due to the 
problem of integrating over the initial conditions. But II is appealing, as it is still trivial to 
simulate data from the model. However, we need GII to deal with the discrete outcomes. 

To proceed, our Monte Garlo experiments are parametrized exactly as for Model 2, except 
that we set T = 15, with choices in the first s = 5 time periods being unobserved (but note that 
exogenous variables are observed in these time periods). 

Auxiliary model #1 is as for Models 1 and 2'. q = 1 and the regressors are za = (1, xu, 
t = s + 1,..., T, where the unobserved yis is set equal to 0. In auxiliary model ^2, q = 2 and 
the regressors are: 

— (1) Xi^s+11 Xis} Zit = (1, Xa , yi^t—li Xi^t—l ); fG{sT2,...,T}, 

for a total of 19 parameters. In auxiliary model ^3, q = A and there are 27 parameters: 

Zi,s+1 — ( 1 ) ) 

'^i,s+2 — (A, Xi^g-f-2, yi^s+lj Xi^g-f-i, Xis) 

— (Aj Xi^g-f-3, yi^g-f-2, Xi^g-f-2, yi^g-t-1, Xi^g-f-l) 

— (1) Xit, yi f—i, yi f— 2 : Xi^t— 2 , yi^t—3)i ^ ^ {s T 4, . . . , T} 

Finally, in auxiliary model ^4, q = 5 and there are 41 parameters: relative to ^3, Zi^g+i, Zi^s ^2 
and Zi^g +3 are augmented by an additional lag of Xig, and 


■2i,s+4 — (1) ®L'5 +i) 

^it — (1) Xit, Xi^t—i, yi^t—2 ) Xi^t—2, yi,t—3i ^i,t—31 4 ); t G {s + 5, . . . , T}. 

Table 3 presents the results of four sets of Monte Garlo experiments, each with 1000 repli¬ 
cations. There are two key findings: First, as with Models 1 and 2, GII generates estimates 
with very little bias. Second, increasing the “richness” of the auxiliary model leads to large effi¬ 
ciency gains relative to auxiliary model #1, particularly when the errors are persistent. However, 
auxiliary model ^4 provides few efficiency gains relative to auxiliary model #3. 
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6.4 Results for Model 4 

Model 4 is a (static) three-alternative probit model with eight unknown parameters: three co¬ 
efficients in each of the two equations for the latent utilities and { 62 i}f=o) 

parameters governing the covariance matrix of the disturbances in these equations (ci and C 2 ). 
We set 610 = 620 = 0, 611 = 612 = 621 = 622 = 1) C 2 = 1, and consider ci G {0,1.33} (implying 
that the disturbances in the latent utilities are respectively independent, or have a correlation of 
0.8). We set n = 2000. 

The auxiliary model is a pair of linear probability models, one for each of the first two 
alternatives: 


yn = zj ai -F 
yi2 = zja2 + ^i2, 

where Zi consists of polynomial functions of the exogenous variables and 

A^[0,S^]. The auxiliary model parameters 9 = are estimated by OLS; this corre¬ 

sponds to maximum likelihood - even though is not diagonal - because the same regressors 
appear in both equations. 

We conduct Monte Carlo experiments using four nested versions of the auxiliary model. In 
auxiliary model ^1, Zi = (1, xn, Xi 2 , Xis), giving a total of 11 parameters. Auxiliary model ^2 
adds all the second-order products of these variables, as well as one third-order product to Zi, 
i.e. 

Zi (1, Xi\ , Xi2 , Xii^ , Xj^-y , Xj^2 ; 5 ^il^i2 ? 5 ^i2^i3 ? ^il^i2^i3^ ? 

for a total of 25 parameters. In auxiliary model ^3, Zi contains all third-order products (for a 
total of 43 parameters) and in auxiliary model ^4, Zi contains all fourth-order products (for a 
total of 67 parameters). 

Tables 4 and 5 present the results of six sets of Monte Carlo experiments, each with 1000 
replications; the labels SML and GII are to be interpreted exactly as for Table 1. The 
key findings are qualitatively similar to those for Models 1, 2, and 3. First, both SML and GII 
generate estimates with very little bias. Second, auxiliary model ^1, which contains only linear 
terms, leads to large efficiency losses relative to SML (as large as 50%). But auxiliary model ^2, 
which contains terms up to second order, reduces the efficiency losses substantially (to no more 
than 15% when the errors are uncorrelated, and to no more than 26% when c = 1.33). Auxiliary 
model ^3, which contains terms up to third order, provides additional small efficiency gains (the 
largest efficiency loss is reduced to 20%), while auxiliary model ^4, which contains fourth-order 
terms, provides few, if any, efficiency gains relative to auxiliary model ^3. Finally, computing 
estimates using GII with auxiliary model ^3 takes about 30% less time than computing estimates 
using SML with 50 draws. 

7 Conclusion 

Discrete choice models play an important role in many fields of economics, from labor economics 
to industrial organization to macroeconomics. Unfortunately, these models are usually quite 
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challenging to estimate (except in special cases like MNL where choice probabilities have closed 
forms). Simulation-based methods like SML and MSM have been developed that can be used 
for more complex models like MNP. But in many important cases (models with initial conditions 
problems and Heckman selection models being leading cases) even these methods are very difficult 
to implement. 

In this paper we develop and implement a new simulation-based method for estimating models 
with discrete or mixed discrete/continuous outcomes. The method is based on indirect inference. 
But the traditional II approach is not easily applicable to discrete choice models because one 
must deal with a non-smooth objective surface. The key innovation here is that we develop a 
generalized method of indirect inference (GII), in which the auxiliary models that are estimated 
on the actual and simulated data may differ (provided that the estimates from both models share 
a common probability limit). This allows us to chose an auxiliary model for the simulated data 
such that we obtain an objective function that is a smooth function of the structural parameters. 
This smoothness renders GII practical as a method for estimating discrete choice models. 

Our theoretical analysis goes well beyond merely deriving the limiting distribution of the 
minimizer of the GII criterion function. Rather, in keeping with computational motivation 
of this paper, we show that the proposed smoothing facilitates the convergence of derivative- 
based optimizers, in the sense that the smoothing leads to a sample optimization problem that 
is no more difficult than the corresponding population problem, where the latter involves the 
minimization of a necessarily smooth criterion. This provides a rigorous justification for using 
standard derivative-based optimizers to compute the GII estimator, which is also shown to inherit 
the limiting distribution of the (unsmoothed) II estimator. Inferences based on the GII estimates 
may thus be drawn in the standard manner, via the usual Wald statistics. Our results on the 
convergence of derivative-based optimizers seem to be new to the literature. 

We also provide a set of Monte Garlo experiments to illustrate the practical usefulness of GII. 
In addition to being robust and fast, GII yields estimates with good properties in small samples. 
In particular, the estimates display very little bias and are nearly as efficient as maximum like¬ 
lihood (in those cases where simulated versions of maximum likelihood can be used) provided 
that the auxiliary model is chosen judiciously. 

GII could potentially be applied to a wide range of discrete and discrete/continuous outcome 
models beyond those we consider in our Monte Garlo experiments. Indeed, GII is sufficiently flex¬ 
ible to accommodate almost any conceivable model of discrete choice, including, discrete choice 
dynamic programming models, discrete dynamic games, etc. We hope that applied economists 
from a variety of fields hnd GII a useful and easy-to-implement method for estimating discrete 
choice models. 
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Table 1 

Monte Carlo Results for Model 1 



Mean 

Std. dev. 

O'GIl/o'SML 

Time 


6 

r 

b 

r 

b 

r 

(sec.) 



1 

> = 1, r - 

= 0 




SML #1 

1.000 

-0.002 

0.0387 

0.0454 

— 

— 

0.76 

SML #2 

1.001 

-0.000 

0.0373 

0.0468 

— 

— 

1.53 

GII #1 

0.998 

0.002 

0.0390 

0.0645 

1.05 

1.37 

0.67 

GII #2 

0.993 

0.001 

0.0386 

0.0490 

1.03 

1.05 

0.72 

GII #3 

0.992 

0.001 

0.0393 

0.0490 

1.05 

1.05 

0.91 

GII #4 

0.988 

0.001 

0.0390 

0.0485 

1.05 

1.04 

0.99 



b 

= 1, r = 

: 0.4 




SML #1 

0.995 

0.385 

0.0400 

0.0413 

— 

— 

0.78 

SML #2 

0.999 

0.392 

0.0390 

0.0410 

— 

— 

1.54 

GII #1 

0.998 

0.399 

0.0454 

0.0616 

1.16 

1.50 

0.70 

GII #2 

0.993 

0.396 

0.0410 

0.0456 

1.05 

1.11 

0.72 

GII #3 

0.991 

0.395 

0.0417 

0.0432 

1.07 

1.05 

0.91 

GII #4 

0.987 

0.392 

0.0416 

0.0432 

1.07 

1.05 

0.97 



b-- 

= 1, r = 

0.85 




SML #1 

0.984 

0.833 

0.0452 

0.0333 

— 

— 

0.74 

SML #2 

0.993 

0.842 

0.0432 

0.0316 

— 

— 

1.47 

GII #1 

0.994 

0.846 

0.0791 

0.0672 

1.83 

2.13 

0.71 

GII #2 

0.991 

0.845 

0.0511 

0.0412 

1.18 

1.30 

0.74 

GII #3 

0.992 

0.846 

0.0492 

0.0357 

1.14 

1.13 

0.93 

GII #4 

0.988 

0.841 

0.0490 

0.0357 

1.13 

1.13 

1.00 
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Table 2 

Monte Carlo Results for Model 2 



bi 

Mean 

r 

b2 

bi 

Std. dev. 

r 

62 

O'GIl/o'SML 

61 r 62 

Time 

(sec.) 




bi = l, r 

= 0 , 62 = 

0.2 





SML #1 

1.000 

0.001 

0.200 

0.0274 

0.0357 

0.0355 

— 

— 

— 

2.47 

SML #2 

1.002 

0.002 

0.199 

0.0273 

0.0362 

0.0365 

— 

— 

— 

4.89 

GII #1 

0.999 

0.001 

0.199 

0.0267 

0.0571 

0.0437 

0.98 

1.58 

1.20 

2.72 

GII #2 

0.996 

0.000 

0.199 

0.0267 

0.0379 

0.0379 

0.98 

1.05 

1.04 

2.80 

GII #3 

0.995 

0.001 

0.199 

0.0269 

0.0377 

0.0376 

0.99 

1.04 

1.03 

3.66 

GII #4 

0.993 

0.000 

0.198 

0.0270 

0.0377 

0.0375 

0.99 

1.04 

1.03 

4.06 




bi 

= 1 , r = 

= 0.4, 62 = 

= 0.2 





SML #1 

0.994 

0.379 

0.214 

0.0278 

0.0314 

0.0397 

— 

— 

— 

2.42 

SML #2 

0.999 

0.389 

0.206 

0.0287 

0.0316 

0.0397 

— 

— 

— 

4.82 

GII #1 

0.997 

0.397 

0.198 

0.0339 

0.0587 

0.0544 

1.18 

1.86 

1.37 

2.73 

GII #2 

0.994 

0.396 

0.198 

0.0293 

0.0386 

0.0462 

1.02 

1.22 

1.16 

2.82 

GII #3 

0.993 

0.396 

0.197 

0.0289 

0.0343 

0.0431 

1.01 

1.09 

1.09 

3.64 

GII #4 

0.991 

0.395 

0.196 

0.0289 

0.0348 

0.0434 

1.01 

1.10 

1.09 

4.02 




bi 

= 1, r = 

0.85, 62 = 

= 0.2 





SML #1 

0.974 

0.831 

0.220 

0.0321 

0.0174 

0.0505 

— 

— 

— 

2.78 

SML #2 

0.987 

0.840 

0.208 

0.0327 

0.0159 

0.0507 

— 

— 

— 

5.47 

GII #1 

1.000 

0.854 

0.183 

0.0952 

0.0633 

0.1185 

2.91 

3.98 

2.34 

3.01 

GII #2 

0.992 

0.852 

0.190 

0.0417 

0.0266 

0.0721 

1.28 

1.67 

1.42 

2.92 

GII #3 

0.992 

0.851 

0.191 

0.0383 

0.0179 

0.0547 

1.17 

1.13 

1.08 

3.68 

GII #4 

0.990 

0.850 

0.188 

0.0379 

0.0175 

0.0548 

1.15 

1.10 

1.09 

4.06 
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Table 3 

Monte Carlo Results for Model 3 




Mean 

r 

b2 

bi 

Std. dev. 

r 

62 

Time 

(sec.) 



bi -- 

= 1 , r = 

0 , 62 = 

0.2 



GII #1 

0.997 

-0.000 

0.200 

0.0272 

0.0532 

0.0387 

3.91 

GII #2 

0.994 

-0.001 

0.200 

0.0271 

0.0387 

0.0347 

4.01 

GII #3 

0.993 

-0.001 

0.199 

0.0272 

0.0385 

0.0345 

4.81 

GII #4 

0.991 

-0.001 

0.199 

0.0275 

0.0389 

0.0347 

5.38 



h = 

M, r = 0.4, 62 = 

0.2 



GII #1 

0.994 

0.397 

0.198 

0.0361 

0.0518 

0.0493 

3.99 

GII #2 

0.991 

0.397 

0.197 

0.0309 

0.0363 

0.0430 

4.00 

GII #3 

0.990 

0.396 

0.196 

0.0306 

0.0317 

0.0399 

4.80 

GII #4 

0.987 

0.395 

0.196 

0.0302 

0.0318 

0.0400 

5.35 



bi = 

1, r = 0.85, 62 = 

= 0.2 



GII #1 

0.993 

0.851 

0.184 

0.0936 

0.0403 

0.1289 

4.41 

GII #2 

0.986 

0.851 

0.191 

0.0546 

0.0249 

0.0905 

4.37 

GII #3 

0.987 

0.850 

0.189 

0.0430 

0.0140 

0.0598 

4.93 

GII #4 

0.984 

0.849 

0.185 

0.0411 

0.0136 

0.0597 

5.56 
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Table 4 

Monte Carlo Results for Model 4 

(610 = 0, 611 = 1, 612 = 1, 620 = 0, 621 = 1, b 22 = 1, Cl = 0, C2 = 1) 



SML 

GII 

C'GIl/o'SML 


#1 #2 

#1 #2 #3 #4 

#1 #2 #3 #4 


Mean 


&10 

0.007 

0.005 

0.003 

0.002 

0.002 

0.002 

— 

— 

— 

— 

611 

1.000 

1.001 

0.995 

0.994 

0.992 

0.990 

— 

— 

— 

— 

&12 

1.000 

1.003 

0.998 

0.997 

0.995 

0.992 

— 

— 

— 

— 

&20 

-0.001 

-0.003 

-0.006 

-0.004 

-0.004 

0.004 

— 

— 

— 

— 

&21 

1.006 

1.007 

1.001 

0.999 

0.997 

0.996 

— 

— 

— 

— 

&22 

1.005 

1.007 

1.004 

1.000 

0.998 

0.996 

— 

— 

— 

— 

Cl 

0.020 

0.010 

0.007 

0.005 

0.005 

0.006 

— 

— 

— 

— 

C2 

1.004 

1.003 

1.006 

1.001 

1.001 

1.002 

— 

— 

— 

— 


Std. dev. 


&10 

0.0630 

0.0628 

0.0720 

0.0666 

0.0656 

0.0665 

1.15 

1.06 

1.04 

1.06 

611 

0.0686 

0.0686 

0.0872 

0.0764 

0.0741 

0.0743 

1.27 

1.11 

1.08 

1.08 

bi2 

0.0572 

0.0574 

0.0719 

0.0667 

0.0632 

0.0646 

1.25 

1.16 

1.10 

1.13 

&20 

0.0663 

0.0657 

0.0745 

0.0686 

0.0677 

0.0676 

1.13 

1.04 

1.04 

1.03 

b2i 

0.1065 

0.1050 

0.1395 

0.1128 

0.1095 

0.1099 

1.33 

1.07 

1.04 

1.05 

&22 

0.1190 

0.1174 

0.1593 

0.1285 

0.1249 

0.1244 

1.36 

1.09 

1.06 

1.06 

Cl 

0.1091 

0.1107 

0.1303 

0.1276 

0.1224 

0.1265 

1.18 

1.15 

1.11 

1.14 

C2 

0.1352 

0.1325 

0.1991 

0.1509 

0.1439 

0.1421 

1.50 

1.14 

1.09 

1.07 



Time 

11.5 

23.1 

7.1 

10.4 

16.4 

34.1 

— — — — 
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Table 5 

Monte Carlo Results for Model 4 



(&10 = 

0, fell = 1, bi2 = 1, 

b20 = 0 , 621 

= 1, &22 

= 1, Cl 

= 1.33, 

C2 = 

1) 


SML 


GII 




cgii/csml 


#1 

#2 

#1 

#2 

#3 

#4 

#1 

#2 

CO 

Mean 

&10 

-0.031 

-0.017 

0.000 

-0.001 - 

0.000 - 

-0.001 

— 

— 

— — 

fell 

0.998 

1.000 

0.993 

0.993 

0.991 

0.989 

— 

— 

— — 

bi2 

1.016 

1.011 

0.998 

0.998 

0.996 

0.994 

— 

— 

— — 

&20 

-0.011 

-0.010 

-0.011 

-0.007 - 

0.007 - 

-0.006 

— 

— 

— — 

&21 

0.992 

0.999 

1.000 

0.997 

0.995 

0.991 

— 

— 

— — 

&22 

1.004 

1.008 

1.006 

1.001 

0.999 

0.995 

— 

— 

— — 

Cl 

1.269 

1.306 

1.347 

1.338 

1.335 

1.330 

— 

— 

— — 

C2 

1.025 

1.011 

0.993 

0.993 

0.995 

0.997 

— 

— 

— — 


Std. dev. 


bio 

0.0693 

0.0698 

0.0789 

0.0776 

0.0758 

0.0757 

1.13 

1.11 

1.09 

1.08 

bn 

0.0587 

0.0588 

0.0696 

0.0658 

0.0632 

0.0636 

1.18 

1.12 

1.07 

1.08 

bi2 

0.0745 

0.0737 

0.0883 

0.0801 

0.0781 

0.0782 

1.20 

1.09 

1.06 

1.06 

&20 

0.0766 

0.0764 

0.0900 

0.0801 

0.0786 

0.0780 

1.18 

1.05 

1.03 

1.02 

&21 

0.0884 

0.0886 

0.1140 

0.0969 

0.0952 

0.0943 

1.29 

1.09 

1.07 

1.06 

&22 

0.1106 

0.1103 

0.1471 

0.1204 

0.1176 

0.1153 

1.34 

1.09 

1.07 

1.05 

Cl 

0.1641 

0.1707 

0.2454 

0.2152 

0.2049 

0.2041 

1.44 

1.26 

1.20 

1.20 

C2 

0.1229 

0.1206 

0.1599 

0.1387 

0.1338 

0.1311 

1.33 

1.15 

1.11 

1.09 



Time 

12.7 

25.6 

7.4 

10.8 

17.1 

34.4 

— — — — 
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A Details of optimization routines 

Both line-search methods (Gauss-Newton and quasi-Newton) involve the use of a positive definite 
Hessian in the approximating model (5.14), and so the problem solved at step s -|- 1 reduces 
to that of “approximately” solving 


minQ(/3(^)-hap(s)), (A.l) 

where ^( 5 ) := — We do not require that q;(s) solve (A.l) exactly; we shall require only 

that it satisfy the strong Wolfe conditions, 

+a(s)P(s))’^P(s)l < C2|Vj)P(^)| 

for 0 < Cl < C 2 < 1, where Q := df^Q (cf. (3.7) in Nocedal and Wright, 2006). For some such 
q;(s), we set -|- a(^s)P{s)- For the Hessians Aj-^p the Gauss-Newton method is only 

applicable to criteria of the form Q{f3) = ||5'(/3)||^, and uses 


VW := -(GT)WG(p)TgJ)W5(/3(^)), 

where G(^g'^ := )]'*'. The Quasi-Newton method with BFGS updating starts with some 

initial positive definite A(o), and updates it according to. 


^(s+l) - ^{s) 


A(,)X(,)x[pA(,) ^ 


where X(^s) •= 0({s)P{s) and (cf. (6.19) in Nocedal and Wright, 2006). 

The trust region method considered here sets A(-g) = which need not be positive 

definite. The procedure then attempts to approximately minimize (5.14), subject to the con¬ 
straint that 1 1/3 11 < 5 ( 5 ), where ^( 5 ) defines the size of the trust region, which is adjusted at each 
iteration depending on the value of 


g(/3(-))-Q(/3(^+i) 

/(s)(0)-/(.)(/?(®+F)’ 


which measures the proximity of the true reduction in Q at step s, with that predicted by the 
approximating model (5.14); the adjustment is made in accordance with Algorithm 4.2 in More 
and Sorensen (1983). Various algorithms are available for approximately solving (5.14) in this 
case, but we shall assume that Algorithm 3.14 from that paper is used. 


B Proofs of theorems under high-level assumptions 

Assumptions R and H are assumed to hold throughout this section, including H5 with I = 0. 
Whenever we require H5 to hold for some I G {1,2}, this will be explicitly noted. The relationships 
between the theorems and the auxiliary results (Propositions B.1-B.5) is illustrated in Figure B.l. 
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Prop. B.2 Prop. B.l Prop. B.3 Prop. B.5 


Figure B.l: Proofs of theorems 


B.l Preliminary results 

Let I3n '■= /3o + for a (possibly) random 5n = Define 

A^(/3) := nV2[0^(/3,A„)-0^(/3o,AO] 

and recall that Gn{/3) ■= dii6^{(3,\n) and G := [9^0(/?o,O)] . As in R5, \n = Op(l) is an 
J^-measurable sequence. As per R6, we fix the order of jackknifing k G {0,..., ko} such that 
= Op(l). Let Cn{0) '■= Cn{y,x]6) and C{d) := E£ji(0). £„ and £„ respectively denote 
the gradient and Hessian of Cn, with H := "KCniO) = G{9)] N{9, e) denotes an open ball of radius 
e, centered at 9. 

Proposition B.l. 

(i) sup^gBll^i(/3>A„) -9^{l3,\n)\\ 4 0; 

(ii) 0"(/3o,A.)-0(/3o,O) = Op(A^i); 

(iii) A^(/3„) = GJn + Op(l + ||<5n||)/ 

Proposition B.2. ForP = (l + ^)(S-R), 

:= r?l^[9i{PoAn) - 9\l3oAn)] - - 9o) - A[0, iL" (B.l) 

Proposition B.3. 

(i) Qli.{l3,Xn) 4 Qli/3,0) =: Q'"{/3) uniformly on B; 

(ii) for every e>0, mf^eB\A(/3o,e) > QiPo); and 

Proposition B.4. If H5 holds for I = 1, then 

(i) GniPn) 4 G; and 

if H5 holds for I G {1, 2} then, uniformly on B, 

(ii) sup^gBl|5/i^(/5, An) - 9^6'(^,0)|| = Op(l); and 
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(iii) 4Q^fc(/3, AO 4 4Q|(/3,0) = 4Q(/3). 

For the next result, let 17 : F —)• M be twice continuously differentiable with a global minimum 
at 7 *. Let Ru := {7 G F | ||5^17(7)|| < e} for some e > 0, and Su '■= {7 G Ru \ i?min[5^17(7)] > 
0}. Applying a routine r G {GN, QN,TR} to U yields the iterates { 7 *-*^}; let 

_ f 7 ^^*) if 7 *^®) G Ru for some s G N 

I 7 '^i otherwise, 

where s* denotes the smallest s for which 7^®) G Ru- When r = TR, the definition of 7(7^^^ TR) 
is analogous, but with Su in place of Ru- In the statement of the next result, Fq := {7 G 
r I 17(7) < 17(71)} for some 71 G F, and is a compact set with 7* G intFo- For a function 
m : F I—)■ let M( 7 ) ;= [d.ym{'y)]^ denote its Jacobian. 

Proposition B.5. Let r G {QN, TR}, and suppose that in addition to the preceding, either 

(i) r = GN and 17(7) = 11^(7)14 inf 7 ero > 0; or 

(ii) r = QN and U is strictly convex on Fq; 

then 7 ( 7 *-°^, r) G Ru n Fq for all G Fq. Alternatively, if r = TR, then G 5f/ Pi Fq 

for all 7 ^°) G Fq. 

B.2 Proofs of Theorems 5.1—5.5 

Throughout this section, [3n '-= for a (possibly) random 5n = Op(n^/^). Let (/?) := 

QZil3,K), ■-= Q^fiP,K), and OM -= ^(/3, A„). 

Proof of Theorem 5.1. We first consider the Wald estimator. We have 

n[Q^{l3n) - Q^m] = 2n'/'[<(/?o) - + A^(/?„)Tit„A^(/3„). 

For Zn as defined in (B.l), we see that by Proposition B.l(ii) and R6 

n^'^[fnWo) - e] = Zn + An) - 0o] = Zn + Op(l), (B.2) 

whence by Proposition B.l (iii), 

n[Q^{Pn) - Q^(/3o)] = 2ZlWG5n + dlG^WGdn + Op{l + ||Jn|| + ||5nf). (B.3) 

Now consider the LR estimator. Twice continuous differentiability of the likelihood yields 

n[QLR(/5) _ QLR(^^)] = _n[Cn{tWn)) - Cnitim 

= - ^A^(/3n)^£n(0i(/3o))A^(/3n) 

+ 0,(||A(((/?n)|P) 
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where by Proposition B.l(ii) and H 4 , 

= n^/^£n( 6 »o) + £n(6»o)n^/^[0^(/3o) - 6 * 0 ] + Op{l) 

= H[Zn + n^/\e’^iPo,Xn)-9o)] 

= HZn + Op(l) (B.4) 

for Zn as in (B.l). Thus by Proposition B.l(iii), 

nlQlfiPn) - QW^n)] = -ZlHGSn - HG5n + Opil + ||h„|| + ||5„f). (B.5) 

Consistency of follows from parts (i) and (ii) of Proposition B.3 and Corollary 3.2.3 in 
van der Vaart and Wellner (1996). Thus by applying Theorem 3.2.16 in van der Vaart and 
Wellner (1996) - or more precisely, the arguments following their (3.2.17) - to (B.3) and (B.5), 
we have 

_ /Jo) = -{G'^UeGr^G'^UeZn + Op{l) (B. 6 ) 

for Ue as in (5.11); the result now follows by Proposition B.2. □ 

Proof of Theorem 5.2. We hrst note that, in consequence of H 3 and Theorem 5.1, —)• /?o, 

0n 9o, and 0™ ;= d'ff \n) ^ ^o- Part (i) then follows from R 2 , H 2 , and Lemma 2.4 in 
Newey and McFadden (1994). Dehning £™(0o) := ^™(/3o) 0; ^ 0 ) for Rr £ {1; • • • > XI] and 

iKf3o,o;eoV ••• if{fio,o-,eoV], 

H 2 and H4 further imply that 

'S R ••• R' 

X ( — SniSjii \ A ^ A (EcjCj )A = A _ _ ^ A = V. 

R R ••• S_ 

Part (hi) is an immediate consequence of Proposition B.4(i). □ 

Proof of Theorem 5.3. We first prove part (i). Let QniP) ■= 9 / 3 ( 5 ^(/ 3 ) and := 9^(5®(/3,0). 

Since /3o £ intB and Q^iP) Q{P) uniformly on B, the global minimum of Q® is interior to 
B, w.p.a.l., whence is non-empty w.p.a.l. Letting {/3n} denote a (random) sequence with 
Pn G R^f. for all n sufficiently large, we have by Proposition B.4(iii) that 

Q^i^n) = QnWn) + Op(l) = Op(l Cn) = Op{l). (B.7) 

Since (j® is continuous and B compact, it follows that d(/3n, R^) A 0, whence ^^(7?®^, R^) A- 0. 

We now turn to part (ii). Recall /3n = /3o + for some Sn = Op(n^/^). For the Wald 

criterion, taking such that /?„ G gives 

Op{l) = = 2[n^/'(^(/3n) - 0n)]^kFG„(/3„) 
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where, for as in (B.l), 

- k) = - k) + A^(/3„) = + G5n + Op(l + ||h„||) 

by (B.2), R6, and parts (ii) and (in) of Proposition B.l. Hence, using Proposition B.4(i), 


Op{l) = 2[5lG'^WG + ZlWG] + Op(l + \\Sn\\). (B.8) 

Similarly, for the LR criterion, taking G in this case gives 

Op{l) = = k^^Cnitnif^kVOniPn) 

where by the twice continuous differentiability of the likelihood. Proposition B.l(iii) and (B.4), 

n^/^Cn[tiPn)] = + UtimKiPn) + Op( || A^(/3„) ||) 

= HZn + HG5n + Op(l + ||hn||). 


Thus by Proposition B.4(i), 

Op(l) = SJ^G'^ HG + zJhG + Op{l + ||hn||)- (B-9) 

By specializing (B.8) and (B.9) to the case where 5n = kk satisfying the 

requirements of part (ii) of the theorem, we see that for Ue as in (5.11), 

n^^kkk - f3o) = -{G'^UeGr^G'^UeZn + Op(l) = - (3o) + 0^(1) 

for e G {W, LR}, in consequence of (B.6). 

Finally, we turn to part (hi). Let /?n denote the minimizer of Qn{P), which lies in Rnk 
w.p.a.l., by part (i), and /?„ another (random) sequence satisfying the requirements of part (iii). 
By Proposition B.3 and the consistency of /3n (Theorem 5.1), 

Q^(/3o) + Op(l) = QUk) + Op{l) > Ql{k) > Ql0n) = + Op(l). 

Thus Q^0n) = Qn0n) + Op{l) A Q®(/3o), also by Proposition B.3; whence k ^ l^o, since 
has a well-separated minimum at /?o- □ 

Proof of Theorem 5.4- Let Q^iP) ■= d‘^Q^{l3), Q^{I3) '■= {/3,0), and {/3n} be a (random) 

sequence with /3„ G for all n sufficiently large. Then by Proposition B.4(iii), 

^{Qmi-a[Q^ iPn)] < ~e} = ^{Qmiia[Qn{i3n)] + Op{l) < —cj < P{Op(l) < —cj —)• 0 

for any e > 0. Hence by Theorem 5.3(i), the continuity of and ^ 0- Part (ii) 

follows immediately from the corresponding part of Theorem 5.3 and the fact that 5®^ C □ 

Proof of Theorem 5.5. For each r G {GN, QN, TR}, suppose that there exists a Bq C B such that 
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^ = QniP) •= Qnkif^^^n) satisfies the corresponding part of Proposition B.5, w.p.a.l. Then 

e Kk n Bo, V/3(0) G Bo} 4 0 

for r G {GN, QN}, and also for r = TR with 5®^ in place of we may take Cn = Op(n“^/^) 
in the definition Further, R® n Bo = {/3o} under GN and QN, w hil e S® n Bo = {/3o} under 

TR. Thus, when r G {GN, QN} we have w.p.a.l, 

sup d0lkiP^°\r),l3o) < dL(R®fcnBo,{/3o}) = Op(n“^/^) 

/3(o)eBo 

with the final estimate following by Theorem 5.3. When r = TR, the preceding holds with 
in place of R^^, in this case via Theorem 5.4. 

It thus remains to verify that the requirements of Proposition B.5 hold w.p.a.l. When 
r = GN, it follows from Proposition B.4(i), the continuity of o'jnin(') and GN that 

0 < inf fTmin[G(/3)] = inf CJmin[G„(/3)] +Op(l), 
pGBo pGBo 

whence inf^gg^ o'min[G'n(/?)] > 0 w.p.a.l. When r = QN, it follows from Proposition B.4(iii) and 
QN that 

0 < inf = inf ^min[5|Pn(/5)] + Op(l) 

P^Bq P^Bq 

whence is strictly convex on Bq w.p.a.l. When r = TR, there are no additional conditions 
to verify. □ 


B.3 Proofs of Propositions B.l—B.5 

Proof of Proposition B.l. Part (i) follows by H3 and the continuous mapping theorem. Part (ii) 
is immediate from (4.3). For part (iii), we note that for /3n = 4 + with 6n = Op(n^/^) as 

above, 

A’fiPn) = (4, Xn) - eHPn, A„)] 

- An) - e\f3o, An)] + n^/^[e\/3n, An) - 4(/3 o, An)]. 

—k 

Since is a linear combination of the 0™’s, it is clear from H3 that the first two terms con¬ 
verge jointly in distribution to identical limits (since /3n 4 /3o)- For the final term, continuous 
differentiability of 6^ (R3 above) entails that 

An) - 4(/3 o, An)] = [dfi9\Po, Xn)VWn " 4) + Opdl^ -41!) 

= GSn + Op(l -|- ll(5n||)- 


Proof of Proposition B.2. Note first that 


□ 
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r=0 


n^/^KiPo,\n) - 9\^o,\n)] =Y.^rk- ^ [6 n{Po , K) " 0(/?O, <5’’A„)] 

M k ^ ^ 

EE Jrk^:^iP0,S^Xn) -- T7 E V'™(/30,0), 


1 

M 


m=l r=0 


m=l 


by (4.3), (4.4), H3 and Ylr=o^rk = 1- By H3, this holds jointly with 

n^^^{9n - Oo) V'°(/3o,0). 

Since H4 implies that 'ip'^{j3o,0) = the limiting variance of Zn is equal to 


var 


M 


/(/3o,0)-^ j;V’"(/3o,0) 


m=l 


= H ^ var 


1 ^ 
-E 

M 


m=l 


,m| ^ ^-ly^-1 


where the final equality follows from H4 and straightforward calculations. 


□ 


Proof of Proposition B.3. We first prove part (i). For the Wald estimator, this is immediate 
from Proposition B.l(i). For the LR estimator, it follows from Proposition B.l(i), H 2 and the 
continuous mapping theorem (arguing as on pp. 144f. of Billingsley, 1968), that 


g;)^(/3) = (£„ o C)(/3, An) A (£ o e^){i3, o) = 


uniformly on B. 

For part (ii), we note that (3 i—)• 0^(/3,0) is continuous by R3, while the continuity of C is 
implied by H 2 , since £„ is continuous. Thus is continuous for e G {W,LR}, and by R4 is 
uniquely minimized at /3o- Hence /3 i—)■ Q^{(3) has a well-separated minimum, which by Ri is 
interior to B. □ 

Proof of Proposition B.4- Part (ii) is immediate from H5, (4.4) and the continuous mapping theo- 

—k T 

rem; it further implies part (i). For part (iii), recall Qn{(3) = dpQ^{j3), and Gn{(3) = [d/sOniP)] ■ 
Then we have 

Q^{(3) = Gnif3)'^Wn[en{l3) - k] Q]i^i(3) = Gn{f3)'^Cnitm- 

Part (i), and similar arguments as were used are used in the proof of part (i) of Proposition B.3, 
yield that g(j(/3) 9^g®(/?,0) =: Q^{f3) uniformly on B. The proof that the second derivatives 

converge uniformly is analogous. □ 

Proof of Proposition B.5. For r = GN, the result follows by Theorem 10.1 in Nocedal and Wright 
(2006); for r = QN, by their Theorem 6.5; and for r = TR, by Theorem 4.13 in More and Sorensen 
(1983). □ 
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Lem. D.l 



Figure C.l: Proof of Proposition 5.1 


C Sufficiency of the low-level assumptions 

We shall henceforth maintain both Assumptions L and R, and address the question of whether 
these are sufficient for Assumption H; that is, we shall prove Proposition 5.1. The main steps 
leading to the proof are displayed in Figure C.l. 

Recall that, as per L8, the auxiliary model is the Gaussian SUR displayed in (5.1) above. For 
simplicity, we shall consider only the case where is unrestricted, but our arguments extend 
straightforwardly to the case where is block diagonal (as would typically be imposed when 
r > 1). Recall that 9 collects the elements of a and Fix an m G {0,1,... M}, and define 

/?, A) CXy^Tlyry{Zi, [3, A), 

temporarily suppressing the dependence of y (and hence ^ri) on m. Collecting := (^ij,..., Cdyi )'^7 
the average log-likelihood of the auxiliary model can be written as 

1 ^ 11 11 ^ 

Cn{y,x;e) = -'^£{yi,Xi]9) = --log27r- -logdetS^- - tr ^^i(a)^i(a)'^ 

Tlj Zi Zi Z Tlj 

2 = 1 L 2=1 

Deduce that there are functions L and I, which are three times continuously differentiable in 
both arguments (at least on int 0), such that 

Cn{y,x;9) = L{Tn;9) £{yi, xf, 9) = [{U; 6) (C.l) 



(C.2) 

(C.3) 

(C.4) 
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Since the elements of the score vector ^*(0) = doii^O) necessarily take one of the forms displayed 
in (C.3) or (C.4), we may conclude that, for any compact subset A C 0, there exists a Ca such 
that 

Esup||ii(0)f < CA^ZiW^ < oo (C.5) 

eeA 

with the second inequality following from L6. 

Regarding the maximum likelihood estimator (MLE), we note that the concentrated average 
log-likelihood is given by 


Cn{y,x;a) 


-y(log 27T + 1) 


- log det 




Lc{Tn-,a) 


which is three times continuously differentiable in a and T„, so long as Tn is non-singular. By 
the implicit function theorem, it follows that otn may be regarded as a smooth function of T„. 
Noting the usual formula for the ML estimates of this holds also for the components of 9 
referring to whence 

C(/3,A) = /i[T-(/3,A)] (C.6) 

for some h that is twice continuously differentiable on the set where has full rank. Under 
L7, this occurs uniformly on B x A w.p.a.l., and so to avoid tiresome circumlocution, we shall 
simply treat h as if it were everywhere twice continuously differentiable throughout the sequel. 
Letting r(/3. A) := Er0(/3, A), we note that the population binding function is given by 


0(/?,A) = Mr(/3,A)]. (C.7) 

Define (p^(/3,X) := n^/^[r™(/3. A) — T(/?,A)], and let A)]^^q denote a vector-valued 

continuous Gaussian process on B x A with covariance kernel 


cov(v.-n/3i, Ai), (/? 2 , A 2 )) = cov(rr (/?!, Ai), rr (/32, A 2 )). 

Note that L6, in particular the requirement that E|| 2 :i||^ < 00 , ensures that this covariance exists 
and is finite. 

Lemma C.l. 

(i) A) A) in x A), jointly for m G {0,..., M}; and 

(ii) if (5.7) holds for I' = I G {1,2}, then 

sup||4t™(/3,A„)-4t(/?, 0)|| =Op(l) (C.8) 

/3eB 

By an application of the delta method, we thus have 
Corollary C.l. Forh{(3,X) := dish[T{/3, X)], 

X) := X) - e{/3, A)] - /i(/3, A)(^™(/3, A) =: A) (C.9) 

in £°°(B X A), jointly for m G {0,..., M}. 
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The proof of Lemma C.l appears in Appendix D. 

Proof of Proposition 5.1. Hi follows from the twice continuous differentiability of L in (C.l). 
The first part of H 2 is an immediate consequence of Lemma C.l(i) and the smoothness of L; 
the second part is implied by (C.5) and Lemma 2.4 in Newey and McFadden (1994). H3 follows 
from Corollary C.l, and immediately entails that, for j3n = /3o + Op(l) and m G {1,...,M}, 
iA(r(/3n, An) = lAn (/3o, 0) + Op(l), where 

1 ” 

V>(r(/3o,0) = - 0(/?o,O)] = -F-i— j;C(/3o,O;0o) + Op(l) 

Tl ' . 

1=1 

for m G {0,1,...,M}; the final equality follows from the consistency of (as implied by 
Corollary C.l) and the arguments used to prove Theorem 3.1 in Newey and McFadden (1994). 
By definition, := ^i”(/^0) 0; 0o) whereupon the rest of H4 follows by the central 

limit theorem, in view of Li and (C.5). Finally, H5 follows from (C.6), (C.7), Lemma C.l(ii) and 
the chain rule. □ 

D Proof of Lemma C.l 

For the purposes of the proofs undertaken in this section, we may suppose without loss of 
generality that D = Idy in L 2 , 7(/3) = /3 in L3, and ||iF||oo < 1- Recalling (5.3) above, we have 

y,(/3, A) = ujriP) ■ n Kxi^sm =: u;.(/3) • K(5,; /3, A). (D.l) 

S^Sr 

Let K and K respectively denote the first and second derivatives of K. For future reference, we 
here note that 


9^y^(/3, A) = •]K(5^;/3, A) + A z.,s-^siSr] I3,\) (D.2) 

S^Sr 

=: Drl{^,\) + \-^Dr2W,\) 

where z^r '■= n^).^, z^r •= and IKs(5; /?, A) := Kx[vs{/5)] ■ ]K(5\{s}; /3, A); and 

djyri/5, A) = A“^ + Zvszlr] ' IKs(5,.; /?, A) (D.3) 

+ X~^Wr{P) ^ ^ ZysZyt ■ KstiSr] /?, A) 

=: X-^Hri{(5,X) + X-^Hr2{P,X) 


for 


Kx[vsif3)]-M'S\{s};(3,X) if s = t, 

Kx[vs{/3)] ■ Kx[vt{/3)] ■ K{S\{s,t};P,X) if s / L 
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D.l Proof of part (ii) 

In view of (C.2), the scalar elements of Tn{/3, A) that depend on (/3, A) take either of the following 
forms: 


TniW, A) := En[yril3, X)ys{l3, A)] 'rn2(/3, A) ;= En[yri/3, A)Ti] (D.4) 

for some r, s € {1,.. .,dy}, or t € {1,.. .,dx}, where E„/(/3, A) ;= ^ Ya=i (Through¬ 

out the following, all statements involving r, s and t should be interpreted as holding for all 
possible values of these indices.) For k G {1, 2} and I G {0,1, 2}, define rfc(/3. A) := Ernkifd, A) - 
a typical scalar element of T{/3, A) - and Tf^(/?, A) := E5^rnfc(/?, A). Thus part (ii) of Lemma C.l 
will follow once we have shown that 

dpTnk{l3,\n) =Tl\j3,Xn) + Op{l) = 5^Tfc(/3, 0) Op(l) (D.5) 

uniformly in /3 G B. The second equality in (D.5) is implied by 

Lemma D.l. r].*^(/3, A^) ^ (9^Tfc(/3,0), uniformly on B, for k G {1,2} and I G {0,1,2}. 

The proof appears at the end of this section. We turn next to the first equality in (D.5). 
We require the following definitions. A function F : Z E is an envelope for the class 
F if supj-gj-|/( 2 ;)| < F{z). For a probability measure Q and a p G (l,oo), let ||/||p,Q := 
(EQ|/( 2 :j)|^’)^/^’. F is Euelidean for the envelope F if 

supiV(e||F||i,Q, J", Lpq) < 

Q 


for some Ci and C 2 (depending on F), where N{€,F,Li^iq) denotes the minimum number of 
Li Q-balls of diameter e needed to cover F. For a parametrized family of functions g(l3, A) = 
g{z;(d,X) : iF 1 —)• let F{g) '■= {g{(d,X) | (/3,A) G B x A}. Since B is compact, we may 

suppose without loss of generality that B C {/? G | ||/3|| < 1}, whence recalling (5.2) and 
(5.4) above. 


Wr{z] /3)\ < Wr < 



if r G {1,.. .dw} 
if r G {dw + 1,. ..dy}. 


Thus by Lemma 22 in Nolan and Pollard (1987) 


Di for L G {K, Ks, Kst}, s, t G {1,..., dy} and 5 C {1,..., d^}, the class 


F{E, S) := {L(5; /3, A) | (/3, A) G B x A} 


is Euclidean with constant envelope; and 
D2 for r G {1,..., dy}, F{wr) is Euclidean for Wr¬ 
it therefore follows by a slight adaptation of the proof of Theorem 9.15 in Kosorok (2008) that 
D3 F{yr) is Euclidean for Wr', 

D4 F{yrDsi) and F{yrDs2) are Euclidean for lL}.lFs|| 2 ;|| 


All 



GENERALIZED INDIRECT INFERENCE 


D5 F{xtDsi) and F{xtDs 2 ) are Euclidean for lEsllzp; 

D6 F{DsiDJ^)^ F{DsiDJ2)^ F{Ds2DIi) and F{Ds2DJ2) are Euclidean for lE,.P14||z|p; 

D7 F{ysHri) and F{ysHr 2 ) are Euclidean for and 

D8 F{xtHri) and F{xtHr 2 ) are Euclidean for lE^Hzlp. 

Let Unf ■= ^ “ ^'/(zi)]. Using the preceding facts, and the uniform law of large 

numbers given as Proposition E.l below, we may prove 

Lemma D.2. The convergence 

SUpHn\d0[ysi(3,Xn)yriP,Xn)]\ + SUp fin\xtdiyrif3, Xn)\ =Op{l). (D.6) 

/3eB /3eB 

holds for I = 0, and also for / G {1, 2} if (5.7) holds with I' = 1. 

The first equality in (C.8) now follows, and thus part (ii) of Lemma C.l is proved. 

Proof of Lemma D.l. Suppose I = 2; the proof when / = 1 is analogous (and is trivial when 
^ = 0). Noting that 

dliyrVs) = Vsdjyr + id0r){di3ys)'^ + {dgys){di3yr)'^ + yrdjys, (D.7) 

it follows from (D.2), (D.3), D6 and D7 that for every A G (0,1], 

\\djiyrys)\\<X-^WrWs{\\zfyi), 

which does not depend on /3, and is integrable by L6. (Here a < h denotes that a < Cb for 
some constant C not depending on b.) Thus by the dominated derivatives theorem, the second 
equality in 

rf (/?, A) = Edjrniifd, A) = djErniiP, A) = djniP, A) 

holds for every A G (0,1]; the other equalities follow from the definitions of and r^. Deduce 
that, so long as A„ > 0 (as per the requirements of Proposition 5.1 above), 

ri\ld,Xn) = djTi{(3,Xn) 4 a|ri(/3,0) 

by the uniform continuity of d'^Ti on B x A. A similar reasoning - but now using D8 - gives the 
same result for t^\ □ 

The proof of Lemma D.2 requires the following result. Let Qui^x denote the it- field generated 
by Vujizi) and x{zi), and let rjiy denote those elements of rj that are not present in Recall that 
rju JL Qw,X‘ 

Lemma D.3. For every p G {0,1, 2}, s, t G {1,..., d^}, S C {1,..., d^} and L G {K^, 

E[\\z,s\nz,trUS; /?, A)2 I < XE[\\z,s\nz,tr \ ^- 4 . (D.S) 
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Proof. Note that for any L G 

US-/3,X)<L^[uM] 

where L{x) = max{|i^(a:)|, \K{x)\}. Let d denote the dimensionality of r]^, and fix a /3 G B. By 
L4 and L5, there is a /c G {1,... d}, possibly depending on (3, and an e > 0 which does not, such 
that 

Ml3) = v*{P) + l^kdvk 

with 1/3^1 > e and r'*(/3) X r]uk- Let ^* 3 , := Qui,x V CF{{qyi}i^k), so that vl{l3) is ^.-measurable, 
and let fk denote the density of rj^k- Then for any q G {0,..., 4}, 

E [\7j,kms;/3,xf I g:,,] < E [M'^Ll{u:{/3) + l3*kVuk) I 

= [\u\‘^Ll{iy*{(3) +f3*ku)fk{u)du 

Jr 

^iPk)~^>^[ L‘^{u)du-sup\u\'^fk{u) 

Jr msk 

< e-'A, (D.9) 

since sup„gR|u|'?/fc(u) < 00 under L4. Finally, we may partition z,ys = {zts^duk)'^ and z,yt = 
{z*J, rjiyk)'^, with the possibility that z,ys = zts and Zi^t = ztt- Then by (D.9), 

E [\\z.snz.trus-p,xf I g:,,] < x\\zt,r\\zt,r < x\\z.sr\M^. 

The result now follows by the law of iterated expectations. □ 

Proof of Lemma D.2. We shall only provide the proof for first term on the left side of (D. 6 ), 
when I = 2; the proof in all other cases are analogous, requiring appeal only to Proposition E.l 
(or Theorem 2.4.3 in van der Vaart and Wellner, 1996, when I = 0) and the appropriate parts of 
DS-DS. 

Recalling the decomposition of d'piyrys) given in (D.7) above, we are led to consider 

ipl3yr){dj3ys)^ = DsiDh + X ^Ds2D1i + X ^DsiD^2 + A ^Ds2L)^2 (D.IO) 

and 

ysd^yr = X~^ysHri X~‘^ysHr2- (DTI) 

Note that by Lemma D.3, and L6 

E||2/,iL,2f <E EE E[||z,,f||z,i||2|K,t(5,;/3,A)|2 | 

S^Sr 

<AE EE 

< A 

and analogously for each of Hri, DgiDj^, Ds 2 DJ^, DsiDJ 2 and Ds 2 DJ 2 - By D6 and D7, the classes 
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formed from these parametrized functions are Euclidean, with envelopes that are po-integrable 
under L6 {po> 2). 

Application of Proposition E.l to each of the terms in D6 and D7, with A playing the role of 
5“^ there, thus yields the result. Negligibility of the final terms in (D.IO) and (D.ll) entail the 
most stringent conditions on the rate at which may shrink to zero, due to the multiplication 
of these by A“^. □ 

D.2 Proof of part (i) 

The typical scalar elements of T„ are as displayed in (D.4) above, i.e. they are averages of random 
functions of the form (liP, •= Vri/S, ^)ys{/3, A) or C2{/3, A) ;= xtUriP, A), for r, s G {1,..., dy} 
and t G {1,..., dj,}. It follows from D3 that T'(Ci) and ^"(^ 2 ) are Euclidean, with envelopes 
El := WrWs and F 2 ■= ||2;||IIA respectively. Since both envelopes are square integrable under L6, 
we have 

supA^(e||Efc|| 2 ,Q,E(Cfe),E 2 ,Q) < 

Q 

for k G {1,2}. Hence (C.9) follows by Theorem 2.5.2 in van der Vaart and Wellner (1996). 

E A uniform-in-bandwidth law of large numbers 

This section provides a uniform law of large numbers (ULLN) for certain classes of parametrized 
functions, broad enough to cover products involving Kx[i^si/d)], and such generalizations as ap¬ 
pear in Lemma D.3 above. Our ULLN holds uniformly in the inverse ‘bandwith’ parameter 
6 = A“^; in this respect, it is related to some of the results proved in Einmahl and Mason (2005). 
However, while their arguments could be adapted to our problem, these would lead to stronger 
conditions on the bandwidth: in particular, p would have to be replaced by 2p in Proposition E.l 
below. (On the other hand, their results yield explicit rates of uniform convergence, which are 
not of concern here.) 

Consider the (pointwise measurable) function class 

Ea := {z f(^^s)iz) I (7,'5) G P X A}, 

and put E := E[i^oo)- The functions : .2" —)• satisfy: 

El sup.^grE||/(..^_5)(2;o)|P < for every d > 0. 

Let E : —)• M denote an envelope for E, in the sense that 

sup \\f{y,s){z)\\ < F{z) 

{'y,S)^Tx [ 1 , 00 ) 

for all 2 ; G We will suppose that E may be chosen such that, additionally, 

E 2 E|E(2;o)|^ < 00 ; and 

E 3 supQ A'(e||E||i^Q, E, Li^o) < Ce~‘^ for some d G (0, 00 ). 

Let {(5n} denote a real sequence with 5n > 1, and A„ := [1, <!„]. 
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Proposition E.l. Under ei-E3, if ^log((5n V n) —)• oo for some m>l, then 

sup 6^ynf(^,S)\\ = Op{l). (E.l) 

(7,5)erx A„ 

Remark E.l. Suppose 5n is an J^-measurable sequence for which log(5n V n) ^ oo. 

Then for every e > 0, there exists a deterministic sequence {(5n} satisfying the requirements of 
Proposition E.l, and for which limsup^^o^ < ^n} > 1 — e. Deduce that 

SUPdlMn/(7,5„)|| = Op(l). 

7er 

The proof requires the following 

Lemma E.l. Suppose J- is a (pointwise measurable) elass with envelope F, satisfying 

(i) ll-^lloo < r; 

(ii) supjgj-||/|| 2 ,p < cr; and 

(hi) supQlV(e||F||pQ, J", Lpq) < Ce~'^. 

Let 6 := m G N and x > 0. Then there exist Ci, (72 G (0, oo), not depending on r, a or 

X, sueh that 



P < sup - /(” 5)]| = 0 I < P max|F(2:i)| > i = o(l). 

[(7,<5)erxA„ J I J 

It thus suffices to show that (E.l) holds when is replaced by Since ei and E3 continue 

to hold after this replacement, it suffices to prove (E.l) when E 2 is replaced by the condition that 
||P"||oo < which shall be maintained throughout the sequel. (The dependence of / and F 

upon n will be suppressed for notational convenience.) 

Letting 4 := e^, define Ank ■= [4,4+1 A 4] for k G {0, ...,iPn}, where iP„ := log4; 
observe that A„ = \Jk=o^nk- Set 

nk — ^ I ^ r X 

and note that ||E||oo < and supjgj-^j, ||/|| 2 ,p < 4^^^- Under E3, we may apply apply 

Lemma E.l to each Fnki with (r, o") = (n^/^,4 ^^^) ^ some e > 0. There thus 
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exist Cl, 6*2 G ( 0 , oo) depending on e such that 

r 'i r 


sup > e ^ < 5^ sup \l^nf('y,S) I 


>e-i6 


(7,<5)erxA. 


fc =0 


(7.i5)erxA„fe 

< Cl exp[-C2n02,5f—) + dlog{e-^d]:-^)] (E.3) 


Kr^ 


k=0 


where 9nk •= n provided 


yke{0,...,Kn} 


n 




(E.4) 


which holds for all n sufficiently large. In obtaining (E.4) we have used 5k < 5n and 9nk ^ 
and these further imply that (E.3) may be bounded by 


Ci(log(J„) exp[-C 2 n^ + e^) + (ilog((5™n^/^)]0 


as n —)■ 00 . Thus (E.l) holds. □ 

Proof of Lemma E.l. Suppose (hi) holds. Define Q := {r“^/ | / G T}, and G := t~^F. Then 

sup|| 5 r|| 2 ,p < sup||/|| 2 ,p < =: 9] 

geg far 

Ilfl'Iloo < 1 for all g G G] and since ||Cn||i,Q < 1, A^(e, ^,Li^q) < Ce~‘^. Hence, by arguments 
given in the proof of Theorem 11.37 in Pollard (1984), there exist Ci,C 2 > 0, depending on x, 
such that 


P 



sup\ gnf\ > 
/6.F 



sup|/r„ 5 r| > 9 
g&g 



< Cl exp[—C 2 n 0 ^(l + x^) + dlog (0 ^)] 


for all n > |x ^9 


□ 
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F Index of key notation 

Greek and Roman symbols 

Listed in (Roman) alphabetical order. Greek symbols are listed according to their English names: 
thus 0, as ‘omega’, appears before 0, as ‘theta’. 

/?, /3o, B structural model parameters, true value, parameter space. Sec. 2 

GII estimator; near-minimizer of . Sec. 5.3 

terminal value for routine r started at . (5.15) 

Cn tuning sequence in the definition of . Sec. 5.5 

djs, de, ... dimensionality of /3, 6, etc. Sec. 3.1 

Enf sample average, ^Ya=i fi^i) . App. D.l 

rjit stochastic components of the structural model . Sec. 2 

R cj-field supporting all observed and simulated variates . Sec. 5.1 

G Jacobian of the population binding function . Sec. 5.3 

Guild) Jacobian of the smoothed sample binding function . Rem. 5.15 

7(/3) (re-)parametrizes the structural model. (5-2) 

'jrk jackknihng weights . (4-3) 

H auxiliary model (population) log-likelihood Hessian . Sec. 5.3 

J total number of alternatives . Sec. 2 

k order of jackknifing (unless otherwise defined) . R6 

ko maximum order (less 1) of differentiability of /3 i—)■ 0(/3, A) . R3 

K, Kx smoothing kernel, Kx{x) ;= iL(A“^x) . (5-3) 

K, product of kernel-type functions . App. E 

iiyi,Xi-,9) ith contribution to auxiliary model log-likelihood . (3-1) 

£i/3,X;6) abbreviates i{yi{l3,X),xf, 6) Sec. 5.1 

i°°iD) space of bounded functions on the set D . H3 

Cn{y,x',6) auxiliary model average log-likelihood . (3.1) 

A, A smoothing parameter, set of allowable values . Sec. 3.3 

m indexes the simulated dataset; m = 0 denotes the data . Sec. 3.1 

M total number of simulations . Sec. 3.1 

fJ^nf centered sample average, 4 . App. D.l 

n total number of individuals . Sec. 2 

A^(0,e) open ball of radius e centered at 9 . App. B.l 

Uriz] (3) linear index in structural model . (5.2a) 

ujr{z',(3) linear index in structural model . (5.2a) 

0,(17, V) variance matrix function . (5.10) 

Pq order of moments possessed by model variates . L6 

(/>™, standardized auxiliary sample score and its weak limit . (3-5) 
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■0™, 0"^ centered auxiliary estimator process and its weak limit . H3 

sample criterion for estimator e (jackknifed) . Sec. 4.2 

(5| large-sample (unsmoothed) limit of note . Sec. 4.2 

R auxiliary model score covariance, Ei™(0o)^(”\^o)''' for rn! ^ m (5.6) 

near-roots of exact roots of Q® . Sec. 5.5 

i?min(^) smallest eigenvalue of symmetric matrix A . Sec. 5.1 

^rik-: subset of R^j^, satisfying second-order conditions . (5.13) 

o'min(.B) smallest singular value of matrix B . Sec. 5.6 

S auxiliary model score variance, E£("(0o)f*™(0o)'''. (5-6) 

T total number of time periods . Sec. 2 

9, 0 auxiliary model parameters, parameter space . Sec. 3.1 

00 pseudo-true parameters implied by /3o . Sec. 3.1 

9n data-based estimate of 0 . Sec. 3.1 

6^{/3,X) simulation-based estimate of 0. (3.3) 

9^{/3,X) population binding function (smoothed, jackknifed) . (4.3) 

—k 

9^{/3,X) sample binding function (smoothed, jackknifed) . (4.4) 

Uitj utility of individual i from alternative j in period t . Sec. 2 

u^j{j3) simulated utilities at /? . Sec. 3.3 

Ue “Hessian” component of limiting variance . (5-11) 

Ve “score” component of limiting variance. (5-11) 

w.p.a.l with probability approaching one . Thm. 5.3 

Wr{z) envelope for a;,.( 2 ;;/?) . Sec. 5.1 

VFn, W Wald weighting matrix and its probability limit . Sec. 3.1 

Xit exogenous covariates for individual i in period t . Sec. 2 

Uitj set = 1 if individual i chooses j in period t . Sec. 2 

y'^j{l3,X) smoothed simulated choice indicators at /3 . Sec. 3.3 

collects Xi and r/™ . Sec. 5.1 


Symbols not connected to Greek or Roman letters 

Ordered alphabetically by their description. 


weak convergence (van der Vaart and Wellner, 1996) . H3 

—)• convergence in probability . Sec. 5 

||x||, IIxIIa Euclidean norm, H-weighted norm of X . Sec. 3.1 

/, / gradient, hessian of / . Sec. 5.3 

d^f, d‘pf gradient, hessian of / w.r.t. (3 . Rem. 5.1 

< left side bounded by the right side times a constant . App. D.l 

||/||p,Q AP(Q) norm of /, i.e. {Eq\ f{zi)\P)^/P . App. D.l 
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